Maximum likelihood estimation (MLE) is an estimation method that allows us to use a sample to estimate the parameters of the probability distribution that generated the sample.
This lecture provides an introduction to the theory of maximum likelihood, focusing on its mathematical aspects, in particular on:
its asymptotic properties;
the assumptions that are needed to prove the properties.
At the end of the lecture, we provide links to pages that contain examples and that treat practically relevant aspects of the theory, such as numerical optimization and hypothesis testing.
Table of contents
The main elements of a maximum likelihood estimation problem are the following:
a sample , that we use to make statements about the probability distribution that generated the sample;
the sample is regarded as the realization of a random vector , whose distribution is unknown and needs to be estimated;
there is a set of real vectors (called the parameter space) whose elements (called parameters) are put into correspondence with the possible distributions of ; in particular:
if is a discrete random vector, we assume that its joint probability mass function belongs to a set of joint probability mass functions indexed by the parameter ; when the joint probability mass function is considered as a function of for fixed , it is called likelihood (or likelihood function) and it is denoted by
if is a continuous random vector, we assume that its joint probability density function belongs to a set of joint probability density functions indexed by the parameter ; when the joint probability density function is considered as a function of for fixed , it is called likelihood and it is denoted by
we need to estimate the true parameter , which is associated with the unknown distribution that actually generated the sample (we rule out the possibility that several different parameters are put into correspondence with true distribution).
A maximum likelihood estimator of is obtained as a solution of a maximization problem:In other words, is the parameter that maximizes the likelihood of the sample . is called the maximum likelihood estimator of .
In what follows, the symbol will be used to denote both a maximum likelihood estimator (a random variable) and a maximum likelihood estimate (a realization of a random variable): the meaning will be clear from the context.
The same estimator is obtained as a solution ofi.e., by maximizing the natural logarithm of the likelihood function. Solving this problem is equivalent to solving the original one, because the logarithm is a strictly increasing function. The logarithm of the likelihood is called log-likelihood and it is denoted by
To derive the (asymptotic) properties of maximum likelihood estimators, one needs to specify a set of assumptions about the sample and the parameter space .
The next section presents a set of assumptions that allows us to easily derive the asymptotic properties of the maximum likelihood estimator. Some of the assumptions are quite restrictive, while others are very generic. Therefore, the subsequent sections discuss how the most restrictive assumptions can be weakened and how the most generic ones can be made more specific.
Note: the presentation in this section does not aim at being one hundred per cent rigorous. Its aim is rather to introduce the reader to the main steps that are necessary to derive the asymptotic properties of maximum likelihood estimators. Therefore, some technical details are either skipped or de-emphasized. After getting a grasp of the main issues related to the asymptotic properties of MLE, the interested reader can refer to other sources (e.g., Newey and McFadden - 1994, Ruud - 2000) for a fully rigorous presentation of MLE theory.
Let be a sequence of random vectors. Denote by the sample comprising the first realizations of the sequencewhich is a realization of the random vector
We assume that:
IID. is an IID sequence.
Continuous variables. A generic term of the sequence is a continuous random vector, whose joint probability density function belongs to a set of joint probability density functions indexed by a parameter (where we have dropped the subscript to highlight the fact that the terms of the sequence are identically distributed).
Identification. If , then the ratiois not almost surely constant. This also implies that the parametric family is identifiable: there does not exist another parameter such that is the true probability density function of .
Integrable log-likelihood. The log-likelihood is integrable:
Maximum. The density functions and the parameter space are such that there always exists a unique solution of the maximization problem:where the rightmost equality is a consequence of independence (see the IID assumption above). Of course, this is the same aswhere is the log-likelihood and are the contributions of the individual observations to the log-likelihood. It is also the same as
Exchangeability of limit. The density functions and the parameter space are such thatwhere denotes a limit in probability. Roughly speaking, the probability limit can be brought inside the operator.
Differentiability. The log-likelihood is two times continuously differentiable with respect to in a neighborhood of .
Other technical conditions. The derivatives of the log-likelihood are well-behaved, so that it is possible to exchange integration and differentiation, compute their first and second moments, and probability limits involving their entries are also well-behaved.
Given the assumptions made above, we can derive an important fact about the expected value of the log-likelihood:
First of all,Therefore, the inequalityis satisfied if and only ifwhich can be also written as(note that everything we have done so far is legitimate because we have assumed that the log-likelihoods are integrable). Thus, proving our claim is equivalent to demonstrating that this last inequality holds. In order to do this, we need to use Jensen's inequality. Since the logarithm is a strictly concave function and, by our assumptions, the ratiois not almost surely constant, by Jensen's inequality we haveBut,Therefore,which is exactly what we needed to prove.
This inequality, called information inequality by many authors, is essential for proving the consistency of the maximum likelihood estimator.
Given the assumptions above, the maximum likelihood estimator is a consistent estimator of the true parameter :where denotes a limit in probability.
We have assumed that the density functions and the parameter space are such thatBut The last equality is true, because, by Kolmogorov's Strong Law of Large Numbers (we have an IID sequence with finite mean), the sample average converges almost surely to and, therefore, it converges also in probability (convergence almost surely implies convergence in probability). Thus, putting things together, we obtainIn the proof of the information inequality (see above), we have seen thatwhich, obviously, impliesThus,
Denote by the gradient of the log-likelihood, that is, the vector of first derivatives of the log-likelihood, evaluated at the point . This vector is often called the score vector.
Given the assumptions above, the score has zero expected value:
First of all, note thatbecause probability density functions integrate to . Now, taking the first derivative of both sides with respect to any component of and bringing the derivative inside the integral:Now, multiply and divide the integrand function by :Sincewe can writeor, using the definition of expected value:which can be written in vector form using the gradient notation asThis result can be used to derive the expected value of the score as follows:
Given the assumptions above, the covariance matrix of the score (called information matrix or Fisher information matrix) iswhere is the Hessian of the log-likelihood, that is, the matrix of second derivatives of the log-likelihood, evaluated at the point .
From the previous proof, we know thatNow, taking the first derivative of both sides with respect to any component of , we obtainRearranging, we getSince this is true for any and any , we can express it in matrix form aswhere the left hand side is the covariance matrix of the gradient. This result is equivalent to the result we need to prove because
The latter equality is often called information equality.
The maximum likelihood estimator is asymptotically normal:In other words, the distribution of the maximum likelihood estimator can be approximated by a multivariate normal distribution with mean and covariance matrix
Denote bythe gradient of the log-likelihood, i.e., the vector of first derivatives of the log-likelihood. Denote bythe Hessian of the log-likelihood, i.e., the matrix of second derivatives of the log-likelihood. Since the maximum likelihood estimator maximizes the log-likelihood, it satisfies the first order conditionFurthermore, by the Mean Value Theorem, we havewhere, for each , the intermediate points satisfyand the notationindicates that each row of the Hessian is evaluated at a different point (row is evaluated at the point ). Substituting the first order condition in the mean value equation, we obtainwhich, by solving for , becomeswhich can be rewritten asWe will show that the term in the first pair of square brackets converges in probability to a constant, invertible matrix and that the term in the second pair of square brackets converges in distribution to a normal distribution. The consequence will be that their product also converges in distribution to a normal distribution (by Slutsky's theorem).
As far as the first term is concerned, note that the intermediate points converge in probability to :Therefore, skipping some technical details, we getAs far as the second term is concerned, we get By putting things together and using the Continuous Mapping Theorem and Slutsky's theorem (see also the exercises in the lecture on Slutsky's theorem), we obtain
By the information equality (see its proof), the asymptotic covariance matrix is equal to the negative of the expected value of the Hessian matrix:
As previously mentioned, some of the assumptions made above are quite restrictive, while others are very generic. We now discuss how the former can be weakened and how the latter can be made more specific.
Assumption 1 (IID). It is possible to relax the assumption that is IID and allow for some dependence among the terms of the sequence (see, e.g., Bierens - 2004 for a discussion). In case dependence is present, the formula for the asymptotic covariance matrix of the MLE given above is no longer valid and needs to be replaced by a formula that takes serial correlation into account.
Assumption 2 (continuous variables). It is possible to prove consistency and asymptotic normality also when the terms of the sequence are extracted from a discrete distribution, or from a distribution that is neither discrete nor continuous (see, e.g., Newey and McFadden - 1994).
Assumption 3 (identification). Typically, different identification conditions are needed when the IID assumption is relaxed (e.g., Bierens - 2004).
Assumption 5 (maximum). To ensure the existence of a maximum, requirements are typically imposed both on the parameter space and on the log-likelihood function. For example, it can be required that the parameter space be compact (closed and bounded) and the log-likelihood function be continuous. Also, the parameter space can be required to be convex and the log-likelihood function strictly concave (e.g.: Newey and McFadden - 1994).
Assumption 6 (exchangeability of limit). To ensure the exchangeability of the limit and the operator, the following condition is often imposed:
Assumption 8 (other technical conditions). See, for example, Newey and McFadden (1994) for a discussion of these technical conditions.
In some cases, the maximum likelihood problem has an analytical solution. That is, it is possible to write the maximum likelihood estimator explicitly as a function of the data.
However, in many cases there is no explicit solution. In these cases, numerical optimization algorithms are used to maximize the log-likelihood. The lecture entitled Maximum likelihood - Algorithm discusses these algorithms.
The following lectures provide detailed examples of how to derive analytically the maximum likelihood (ML) estimators and their asymptotic variance:
ML estimation of the parameter of the exponential distribution
ML estimation of the parameters of the multivariate normal distribution
ML estimation of the parameters of a normal linear regression model
The following lectures provides examples of how to perform maximum likelihood estimation numerically:
ML estimation of the degrees of freedom of a standard t distribution (MATLAB example)
ML estimation of the coefficients of a logistic classification model
ML estimation of the coefficients of a probit classification model
The following sections contain more details about the theory of maximum likelihood estimation.
Methods to estimate the asymptotic covariance matrix of maximum likelihood estimators, including OPG, Hessian and Sandwich estimators, are discussed in the lecture entitled Maximum likelihood - Covariance matrix estimation.
Tests of hypotheses on parameters estimated by maximum likelihood are discussed in the lecture entitled Maximum likelihood - Hypothesis testing, as well as in the lectures on the three classical tests:
Bierens, H. J. (2004) Introduction to the mathematical and statistical foundations of econometrics, Cambridge University Press.
Newey, W. K. and D. McFadden (1994) "Chapter 35: Large sample estimation and hypothesis testing", in Handbook of Econometrics, Elsevier.
Ruud, P. A. (2000) An introduction to classical econometric theory, Oxford University Press.
Please cite as:
Taboga, Marco (2021). "Maximum likelihood estimation", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/maximum-likelihood.
Most of the learning materials found on this website are now available in a traditional textbook format.