The empirical distribution, or empirical distribution function, can be used to describe a sample of observations of a given variable. Its value at a given point is equal to the proportion of observations from the sample that are less than or equal to that point.
The following is a formal definition.
Definition
Letbe
a sample of size
,
where
,...,
are the
observations from the sample. The empirical distribution function of the
sample
is the function
defined
as
where
is an indicator function that is equal to
if
and
otherwise.
In other words, the value of the empirical distribution function at a given
point
is obtained by:
counting the number of observations that are less than or equal to
;
dividing the number thus obtained by the total number of observations, so as
to obtain the proportion of observations that is less than or equal to
.
An example follows.
Example
Suppose we observe a sample made of four
observations:where
What
is the value of the empirical distribution function of the sample
at the point
?
According to the definition above, it
is
In
other words, the proportion of observations that are less than or equal to
is
.
Let
,...,
be the sample observations ordered from the smallest to the largest (in
technical terms, the order statistics of the sample).
Then it is easy to see that the empirical distribution function can be written
asThis
is a function that is everywhere flat except at sample points, where it jumps
by
.
It is the distribution
function of a discrete random variable
that can take any one of the values
,...,
with probability
.
In other words, it is the distribution function of a discrete variable
having probability mass
function
When the
observations from the sample
,...,
are the
realizations of
random variables
,...,
,
then the value
taken by the empirical distribution at a given point
can also be regarded as a random variable. Under the hypothesis that all the
random variables
,...,
have the same distribution, the expected value and the variance of
can be easily computed, as shown in the following proposition.
Proposition
If the
observations in the
sample
are
the realizations of
random variables
,...,
having the same distribution function
,
then
for
any
.
Furthermore, if
,...,
are
mutually
independent,
then
for
any
.
The result about the expected value is
proved by using the definition of distribution function and the properties of
indicator
functions (in particular, the fact that the expected value of an indicator
is equal to the probability of the event it
indicates):The
result about the variance is proved as
follows:
Thus, for any given point, the empirical distribution function is an unbiased
estimator of the true distribution function. Furthermore, its variance tends
to zero as the sample size becomes large (as
tends to infinity).
An immediate consequence of the previous result is that the empirical distribution converges in mean-square to the true one.
Proposition
If the
observations in the
sample
are
the realizations of
mutually independent random variables
,...,
having the same distribution function
,
then
for
any
.
We have
that
As a matter of fact, it is possible to prove a much stronger result, called
Glivenko-Cantelli theorem, which states that not only
converges almost surely to
for each
,
but it also converges uniformly, that
is,
Furthermore, the assumption that the random variables
,...,
be mutually independent can be relaxed (see, e.g., Borokov 1999) to allow for
some dependence among the observations (similarly to what can be done for the
Law of Large Numbers; see Chebyshev's Weak Law
of Large Numbers for correlated sequences).
Borokov, A. A. (1999) Mathematical statistics, CRC Press.
Please cite as:
Taboga, Marco (2021). "Empirical distribution", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/asymptotic-theory/empirical-distribution.
Most of the learning materials found on this website are now available in a traditional textbook format.