The empirical distribution, or empirical distribution function, can be used to describe a sample of observations of a given variable. Its value at a given point is equal to the proportion of observations from the sample that are less than or equal to that point.
The following is a formal definition.
Definition Letbe a sample of size , where ,..., are the observations from the sample. The empirical distribution function of the sample is the function defined aswhere is an indicator function that is equal to if and otherwise.
In other words, the value of the empirical distribution function at a given point is obtained by:
counting the number of observations that are less than or equal to ;
dividing the number thus obtained by the total number of observations, so as to obtain the proportion of observations that is less than or equal to .
An example follows.
Example Suppose we observe a sample made of four observations:whereWhat is the value of the empirical distribution function of the sample at the point ? According to the definition above, it isIn other words, the proportion of observations that are less than or equal to is .
Let ,..., be the sample observations ordered from the smallest to the largest (in technical terms, the order statistics of the sample).
Then it is easy to see that the empirical distribution function can be written asThis is a function that is everywhere flat except at sample points, where it jumps by . It is the distribution function of a discrete random variable that can take any one of the values ,..., with probability . In other words, it is the distribution function of a discrete variable having probability mass function
When the observations from the sample ,..., are the realizations of random variables ,...,, then the value taken by the empirical distribution at a given point can also be regarded as a random variable. Under the hypothesis that all the random variables ,..., have the same distribution, the expected value and the variance of can be easily computed, as shown in the following proposition.
Proposition If the observations in the sampleare the realizations of random variables ,..., having the same distribution function , thenfor any . Furthermore, if ,..., are mutually independent, thenfor any .
The result about the expected value is proved by using the definition of distribution function and the properties of indicator functions (in particular, the fact that the expected value of an indicator is equal to the probability of the event it indicates):The result about the variance is proved as follows:
Thus, for any given point, the empirical distribution function is an unbiased estimator of the true distribution function. Furthermore, its variance tends to zero as the sample size becomes large (as tends to infinity).
An immediate consequence of the previous result is that the empirical distribution converges in mean-square to the true one.
Proposition If the observations in the sampleare the realizations of mutually independent random variables ,..., having the same distribution function , thenfor any .
We have that
As a matter of fact, it is possible to prove a much stronger result, called Glivenko-Cantelli theorem, which states that not only converges almost surely to for each , but it also converges uniformly, that is,
Furthermore, the assumption that the random variables ,..., be mutually independent can be relaxed (see, e.g., Borokov 1999) to allow for some dependence among the observations (similarly to what can be done for the Law of Large Numbers; see Chebyshev's Weak Law of Large Numbers for correlated sequences).
Borokov, A. A. (1999) Mathematical statistics, CRC Press.
Please cite as:
Taboga, Marco (2021). "Empirical distribution", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/asymptotic-theory/empirical-distribution.
Most of the learning materials found on this website are now available in a traditional textbook format.