The Kullback-Leibler divergence is a measure of the dissimilarity between two probability distributions.
We are going to give two separate definitions of Kullback-Leibler (KL) divergence, one for discrete random variables and one for continuous variables.
Definition
Let
and
be two discrete random
variables with
supports
and
and probability mass
functions
and
.
Let
,
so
that
Then
the KL divergence of
from
is
Note that the summation is over the support of
,
so that we always have
and
,
and, as a consequence, the natural
logarithm
is
always well-defined.
The KL divergence
measures how much the distribution defined by
is dissimilar from the reference distribution defined by
.
The definition for continuous random variables is analogous.
Definition
Let
and
be two continuous
random variables with supports
and
and probability density
functions
and
such
that
for
any set
.
Then the KL divergence of
from
is
In order to be entirely rigorous the above definition should also specify that
the sets
must be measurable (see the lecture on
probability for a
definition of measurable set).
Property (1), which is called absolute continuity, requires that if the
distribution associated to the density
assigns a non-zero probability to a set
,
then also the distribution
must assign a non-zero probability to that set. This requirement is analogous
to that for discrete variables and ensures that
is
well-defined on all sets that have
non-zero
probability.
The next proposition states a fundamental property of the Kullback-Leibler divergence.
Proposition
Let
and
be two probability mass functions and
.
If the two probability mass functions coincide, that is, if
for
all
,
then
Otherwise,
if they do not coincide,
then
Let us first prove the equality part. If the
two probability mass functions coincide, then
for
and
When
they do not coincide, then we
have
where:
in step
we have written the summation as an
expected value with
respect to the probability distribution of
;
in step
,
we have used
Jensen's
inequality (the function
is strictly convex in
and the random variable
is not constant because the two probability mass function do not coincide); in
step
we have used the fact that a sum of probabilities cannot be greater than 1.
A similar result holds for continuous variables.
Proposition
Let
and
be two probability density functions such that their KL divergence is
well-defined. If the two probability density function coincide
almost surely, that is, if
for
all measurable sets
,
then
Otherwise,
if they do not coincide almost surely, then
The proof is analogous to that for discrete variables.
An often cited property of the KL divergence is that it is not symmetric, that
is, in general there is no guarantee
that
In fact, it is even possible that
exists when
is not well-defined: as you can check by looking at the definition of KL
divergence, this happens when the support of
is strictly included in the support of
:
Since the Kullback-Leibler divergence is an information-theoretic concept and most of the students of probability and statistics are not familiar with information theory, they struggle to get an intuitive understanding of the reason why the KL divergence measures the dissimilarity of a probability distribution from a reference distribution. We provide an explanation that is entirely based on probabilistic concepts.
Suppose that
and
are two probability mass functions such that the KL divergence
is well-defined.
Take a convex combination of the two
distributionswhere
.
By increasing
we can make
more and more similar to
until, when
,
and
coincide.
It is possible to prove that the KL divergence is convex (see
Cover and Thomas 2006) and, as a
consequence,
Thus, the higher
is, the smaller
becomes. In other words, the more
is similar to
,
the smaller the Kullback-Leibler divergence becomes.
Cover, T. M., and J. A. Thomas (2006) " Elements of information theory", Wiley-Interscience.
Please cite as:
Taboga, Marco (2021). "Kullback-Leibler divergence", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-probability/Kullback-Leibler-divergence.
Most of the learning materials found on this website are now available in a traditional textbook format.