Entropy & information

Information theory

Probability theory

Entropy

Surprisal

KL divergence

Mutual information

Python

SciPy

Published

October 12, 2025

Introduction

Large language models (LLMs) like ChatGPT have brought probabilistic models into mainstream conversation. Unlike deterministic models that always produce the same output for a given input, these models generate a probability distribution over many possible outputs. Information theory provides the mathematical framework for understanding these distributions, and how sensitive they are to the inputs of the model.

To understand this more concretely, consider what happens when you prompt an LLM. You might ask “What’s the capital of France?” and reliably get “Paris” as the answer. The model is highly confident: the output probability distribution is concentrated over the one-word output “Paris”. But if you ask “Write me a poem about autumn,” you’ll get a different response each time. Here, the probability distribution is spread across countless possible poems, each with some small probability of being generated. In the first case, the model is certain, reflected in the strongly peaked output distribution, while in the latter case it is uncertain, reflected in the highly dispersed output distribution. The input to the model (i.e., the prompt) thus has a strong effect on the certainty in the output distribution.

The gap between certainty and uncertainty lies at the heart of information theory. When Claude Shannon founded the field (see Shannon 1948), he was grappling with a fundamental question in telecommunications: how much data can you reliably transmit through a noisy communication channel? Telephone wires introduce random distortions like static and interference, scrambling the signal in unpredictable ways. Shannon realized he could model this mathematically using probability theory, and what began as a framework for analyzing communication over literal telephone wires evolved into an abstract mathematical theory with remarkably broad applications. Indeed, Shannon has been credited with laying the foundations for our modern information-based society.

Interestingly, Shannon’s notion of entropy has deep connections to the concept of entropy in statistical mechanics and thermodynamics. In 1957, E. T. Jaynes famously formalized this connection in his influential paper (Jaynes 1957), where he wrote:

“The mere fact that the same mathematical expression \(-\sum p_i \log{p_i}\) [for entropy] occurs both in statistical mechanics and in information theory does not in itself establish any connection between these fields. This can be done only by finding new viewpoints from which thermodynamic entropy and information-theory entropy appear as the same concept. In this paper we suggest a reinterpretation of statistical mechanics which accomplishes this, so that information theory can be applied to the problem of justification of statistical mechanics.””

Even my undergraduate thermodynamics textbook devoted an entire chapter to Shannon’s information theory, emphasizing how these mathematical ideas provide a unifying perspective across seemingly different domains.

We will begin by surveying the most basic quantities of information theory: surprisal, entropy, Kullback–Leibler (KL) divergence, and mutual information. Surprisal is a quantity attached to a random outcome that is inversely linked to its probability: a likely outcome is not surprising, while an unlikely one is highly surprising. When we average surprisal over all possible random outcomes, we obtain the entropy, which quantifies the overall uncertainty in a probability distribution. The KL divergence measures how one probability distribution differs from another, capturing the “distance” between them. Mutual information can be viewed as a special kind of KL divergence applied to two random variables, \(X\) and \(Y\): it measures how much observing \(X\) reduces the uncertainty (entropy) in \(Y\) on average. This last quantity connects directly to our discussion of LLMs: it captures precisely how much the input prompt narrows down the output distribution.

This post is the first in a series on information. In future posts, we will explore other ways the concept of information appears—for example, through \(\sigma\)-algebras—and apply these ideas to a range of problems, from gambling strategies and games of chance (the historical origin of mathematical probability theory), to options pricing in mathematical finance, and to probabilistic models in machine learning. I have discussed information theory previously in a chapter of my book; while some material overlaps with that chapter, this series also introduces many new perspectives and examples.

Flows of information

We begin by building a mathematical gadget—a kind of probabilistic framework—that models the “flow of information” between two random variables \(X\) and \(Y\) (or random vectors, or random objects, or …). Such flows are exactly what information theory calls communication channels, and they include many of the predictive probabilistic models in machine learning where information flows from input \(X\) to output \(Y\). Once these flows have been identified in this section, in the rest of the post we seek to understand how information flowing from \(X\) influences and shapes the distribution of \(Y\).

The simplest flow between \(X\) and \(Y\) is a functional one, expressed as an equation \[ g(X)=Y, \tag{1}\]

where \(g\) is a function. With \(X\) as input and \(Y\) as output, each \(X=x\) produces a unique output \(y = g(x)\). Such flows underlie deterministic models. In the case that \(X\) and \(Y\) take values on the real line \(\mathbb{R}\), we might visualize the situation like this:

Note that each value of \(x\) along the input (left) axis determines a unique value of \(y\) along the output (right) axis.

On the other hand, we might suppose that information flows from \(X\) to \(Y\) in a stochastic fashion, in which \(X=x\) no longer determines a single \(y\), but instead induces a distribution over possible \(Y\) values. This is precisely what a conditional distribution \(P(Y= y\mid X=x)\) captures: given an observed value \(X=x\), we have a probability distribution on \(y\)’s. We can think of this as a function of the form

\[ x \mapsto P(Y= y \mid X=x), \tag{2}\]

where \(y\) plays the role of a variable rather than a fixed quantity, so that \(P(Y= y \mid X=x)\) is a probability distribution and not just a single probability. So this function is rather special: its input is a value \(x\), while its output is an entire probability distribution. Mathematicians call such objects Markov kernels. A figure will make this more concrete: in contrast to the deterministic flows drawn above, in the case that \(X\) and \(Y\) take values on the real line \(\mathbb{R}\), we visualize a stochastic flow as follows, where each value of \(x\) is mapped to a probability distribution on \(y\)’s:

In our model of a stochastic flow, we therefore use conditional distributions, understood as functions on the conditioning variable.

In practice, we often model such flows with a family \(P(Y=y;\theta)\) of distributions parameterized by a vector \(\theta\). The stochastic flow from \(X\) to \(Y\) is then implemented as a function \(x\mapsto \theta(x)\) from observations of \(X\) to parameters \(\theta\), and the conditional distribution is then defined as

\[ P(Y=y \mid X=x) = P(Y=y ; \theta=\theta(x)). \]

Linear regression (with known variance \(\sigma^2\)) is a familiar example: here \(P(Y=y;\theta)\) has the normal density

\[ f(y;\theta) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left[ - \frac{1}{2\sigma^2}(y-\theta)^2 \right], \]

and with parameter mapping

\[ x\mapsto \theta(x) = \beta_0 + \beta_1x, \]

for some model coefficients \(\beta_0\) and \(\beta_1\). Concisely, the stochastic flow from \(X\) to \(Y\) in a linear regression model is completely described by specifying

\[ (Y\mid X=x) \sim \mathcal{N}(\beta_0+\beta_1x, \sigma^2). \]

We will return to an information-theoretic treatment of linear regression (and other) models in a later post.

For now, let’s see all this in action with real distributions in a real-world context. Suppose that \(X\) is the number of hours that a randomly chosen student studies for an upcoming exam, restricted to discrete values \(X = 1,2,3,4,5,6\). Let’s say that most students study only 2 or 3 hours, with its full distribution (mass function \(f(x)\)) shown below:

Code

# Set the number of possible study hours (1 through 6)
n = 6

# Create a sequence of possible values for X (hours studied)
xs = range(1, n + 1)

# Compute the probability mass function for X using a Poisson distribution with mean 3
fxs = poisson.pmf(xs, mu=3)

# Normalize the probabilities so they sum to 1 (ensures a valid probability distribution)
fxs /= fxs.sum()

# Create a new matplotlib figure and axis with a specified size
_, ax = plt.subplots(figsize=(6, 4))

# Plot a bar chart of the probability mass function for X
ax.bar(xs, fxs, width=0.4, zorder=2)

# Label the x-axis as "hours studied (x)"
ax.set_xlabel(r"hours studied ($x$)")

# Label the y-axis as "probability"
ax.set_ylabel("probability")

# Set the plot title to indicate this is the marginal mass function f(x)
ax.set_title(r"marginal mass $f(x)$")

# Set the x-axis ticks to match the possible values of X
ax.set_xticks(range(1, n + 1))

# Adjust layout for better appearance and display the plot
plt.tight_layout()
plt.show()

We might reasonably believe that \(X\) is predictive of \(Y\), the exam score of a randomly chosen student, taking continuous values in the interval \([0,1]\), understood as percentages. The corresponding marginal density \(f(y)\) is shown below:

Code

# Define the marginal density function fy(y) for exam scores.
# This is a mixture of Beta distributions, weighted by the probability mass fxs for each x.
def fy(y):
    return sum([beta.pdf(y, a=x, b=3) * fx for x, fx in zip(xs, fxs)])

# Create a new matplotlib figure and axis with a specified size.
_, ax = plt.subplots(figsize=(6, 4))

# Create a grid of y values between 0 and 1 for plotting the density.
grid = np.linspace(0, 1, num=250)

# Plot the marginal density fy(y) over the grid.
ax.plot(grid, fy(grid))

# Shade the area under the density curve for visual emphasis.
ax.fill_between(grid, fy(grid), zorder=2, alpha=0.1)

# Format the x-axis labels as percentages (since y is a proportion).
ax.xaxis.set_major_formatter(PercentFormatter(xmax=1))

# Set the plot title and axis labels.
ax.set_title(r"marginal density $f(y)$")
ax.set_xlabel("test score ($y$)")
ax.set_ylabel("probability density")

# Adjust layout for better appearance and display the plot.
plt.tight_layout()
plt.show()

Together, \(X\) and \(Y\) have a joint mass/density function \(f(x,y)\), visualized in the following ridgeline plot, where each of the horizontal density curves shows \(f(x,y)\) as a function of \(y\), for fixed \(x=1,2,3,4,5,6\).

Code

# Create a new matplotlib figure and axis with a specified size
_, ax = plt.subplots(figsize=(6, 5))

# Generate a list of colors for each value of x using the custom colormap
conditional_colors = [conditional_cmap(i / (n - 1)) for i in range(n)]

# Loop over each possible value of x and its probability fx
for x, fx in zip(xs, fxs):
    # Compute the joint density values for each y in the grid, scaled for visualization
    joint_vals = 1.7 * beta.pdf(x=grid, a=x, b=3) * fx

    # Fill the area between the baseline (x) and the curve (x + joint_vals) for ridgeline effect
    ax.fill_between(
        grid, x, x + joint_vals, color=conditional_colors[x - 1], zorder=2, alpha=0.1
    )

    # Plot the top edge of the density curve for each x
    ax.plot(grid, x + joint_vals, color=conditional_colors[x - 1], zorder=2)

# Label the y-axis as "hours studied (x)"
ax.set_ylabel(r"hours studied ($x$)")

# Label the x-axis as "test score (y)"
ax.set_xlabel(r"test score ($y$)")

# Format the x-axis labels as percentages
ax.xaxis.set_major_formatter(PercentFormatter(xmax=1))

# Set the plot title to indicate this is the joint mass/density f(x, y)
ax.set_title(r"joint mass/density $f(x,y)$")

# Adjust layout for better appearance and display the plot
plt.tight_layout()
plt.show()

Dividing the joint mass/density \(f(x,y)\) by the marginal mass \(f(x)\) yields the conditional densities \(f(y|x)\). These are just the same density curves in the ridgeline plot above, normalized so that they integrate to \(1\) over \([0,1]\). They are shown in:

Code

# Create a new matplotlib figure and axis with a specified size
_, ax = plt.subplots(figsize=(6, 4))

# Loop over each possible value of x (hours studied)
for x in xs:
    # Plot the conditional density f(y|x) for each x as a Beta(x, 3) distribution
    ax.plot(
        grid,
        beta.pdf(x=grid, a=x, b=3),
        color=conditional_colors[x - 1],
        label=x
    )

# Add a legend indicating the value of x for each curve
ax.legend(title=r"hours studied ($x$)", loc="center left", bbox_to_anchor=(1, 0.5))

# Format the x-axis labels as percentages (since y is a proportion)
ax.xaxis.set_major_formatter(PercentFormatter(xmax=1))

# Set the plot title to indicate these are conditional densities f(y|x)
ax.set_title(r"conditional densities $f(y|x)$")

# Label the axes
ax.set_xlabel(r"test score ($y$)")
ax.set_ylabel("probability density")

# Adjust layout for better appearance and display the plot
plt.tight_layout()
plt.show()

In our information-theoretic terminology, the conditional density

\[ x\mapsto f(y|x), \]

thought of as a function of \(x\), models the stochastic flow of information from \(X\) to \(Y\).

The marginal density \(f(y)\) shows moderate uncertainty—it’s somewhat peaked, but not sharply. An exam score randomly drawn from \(Y\) will be mildly uncertain, mildly surprising. The exact amount of uncertainty in \(Y\) will be measured through its entropy, denoted \(H(Y)\), introduced in the next section. In contrast, the conditional densities \(f(y|x)\) exhibit less uncertainty compared to the marginal, especially for values of \(x\) closer to \(6\). The uncertainty remaining in \(Y\) after observing \(X=x\) is measured by the conditional entropy, denoted \(H(Y\mid X=x)\). Averaging this conditional entropy over \(X\) yields the quantity

\[ H(Y\mid X) \overset{\text{def}}{=}E_{x\sim f(x)}(H(Y\mid X=x)), \]

the average amount of uncertainty in \(Y\), given \(X\). Then, it is a general observation that

\[ H(Y) \geq H(Y\mid X) \]

for any pair of random variables \(X\) and \(Y\), reflecting the obvious fact that no additional information will ever increase the uncertainty in \(Y\). Thus, the quantity

\[ I(X,Y) \overset{\text{def}}{=}H(Y) - H(Y\mid X) \]

is a nonnegative proxy for the amount of information transmitted from \(X\) to \(Y\): if it is large, then the gap between \(H(Y)\) and \(H(Y\mid X)\) is wide, indicating that observations of \(X\) greatly reduce the uncertainty in \(Y\). We understand this as a “large amount of information” is transmitted from \(X\) to \(Y\). Conversely, when \(I(X,Y)\) is small, observations of \(X\) reveal little about \(Y\); in the extreme case \(I(X,Y)=0\), the two are independent. The quantity \(I(X,Y)\) is exactly the mutual information between \(X\) and \(Y\), introduced in the next section.

Surprisal and entropy

As mentioned in the introduction, entropy measures the uncertainty in the outcome of a random variable. More precisely, it is the average surprisal of an observation. Surprisal varies inversely with probability: large probabilities yield small surprisals, and small probabilities yield large ones.

This inverse relationship is given by the function \(s = -\log{p}\), linking a probability \(p\in [0,1]\) with a surprisal \(s\in [0,\infty)\). The graph of this relationship is shown in:

We write \(\log\) for the base-\(e\) logarithm.

Code

# Create a grid of probability values from 0.01 to 1 (avoiding 0 to prevent log(0))
mesh = np.linspace(0.01, 1, num=100)

# Create a new matplotlib figure and axis with a specified size
_, ax = plt.subplots(figsize=(4, 3))

# Plot the surprisal function s = -log(p) as a function of probability p
ax.plot(mesh, -np.log(mesh), color=yellow)

# Label the x-axis as probability (p)
ax.set_xlabel(r"probability ($p$)")

# Label the y-axis as surprisal (s)
ax.set_ylabel(r"surprisal ($s$)")

# Adjust layout for better appearance and display the plot
plt.tight_layout()
plt.show()

At first glance, many functions might seem equally capable of expressing this inverse relationship between probability and surprisal—so why the choice of base-\(e\) logarithm? It turns out that if one starts from a few natural axioms that any reasonable notion of surprisal should satisfy, then you can prove all such surprisal functions must be proportional to negative logarithms (see, for example, the discussion in Section 9 in Rioul 2021). The choice of base \(e\) is then somewhat arbitrary, akin to choosing units. Another popular choice is base \(2\), which aligns naturally with bit strings in coding theory. In base \(e\), information content is measured in so-called natural units, or nats; in base \(2\), it is measured in binary units, or bits. (See Section 10 in the aforementioned reference Rioul 2021 for more on units.)

This link between surprisals and probabilities may be extended to a link between surprisal and probability densities in the case that the probabilities are continuous. Since it is inconvenient to continually distinguish between mass and density functions in all definitions and theorems, we will follow the convention in measure-theoretic probability theory and refer to all probability mass and density functions as densities and denote them all by \(f\). In this scheme, a probability mass function really is a density function relative to the counting measure.

With this convention in mind, the following definition applies to both discrete and continuous random variables:

Definition 1 Let \(X\) and \(Y\) be two random variables with density functions \(f(x)\) and \(f(y)\), respectively.

The surprisal of an observed value \(X=x\) is the quantity \[ s(x) = -\log{f(x)}. \]
The conditional surprisal of an observed value \(Y=y\), given \(X=x\), is the quantity \[ s(y|x) = -\log{f(y|x)}, \] where \(f(y|x)\) is the conditional density of \(Y\) given \(X\).

For a simple example of the relationship between discrete probabilities and surprisals, let’s bring back our random variable \(X\) from the previous section, which tallied the number of hours a randomly chosen student studied for the upcoming exam:

Code

# Create a figure with two subplots side by side
fig, axes = plt.subplots(ncols=2, figsize=(6, 3))

# Plot the marginal probability mass function f(x) as a bar chart on the first subplot
axes[0].bar(xs, fxs, width=0.4, zorder=2)
axes[0].set_ylabel(r"probability mass")  # Label y-axis
axes[0].set_title(r"marginal density $f(x)$")  # Set subplot title
axes[0].set_xticks(range(1, n + 1))  # Set x-ticks to match possible values of x

# Plot the marginal surprisal s(x) = -log(f(x)) as a bar chart on the second subplot
axes[1].bar(xs, -np.log(fxs), width=0.4, zorder=2)
axes[1].set_ylabel(r"surprisal mass")  # Label y-axis
axes[1].set_title(r"marginal surprisal $s(x)$")  # Set subplot title
axes[1].set_xticks(range(1, n + 1))  # Set x-ticks to match possible values of x

# Add a shared x-axis label for both subplots
fig.supxlabel(r"hours studied ($x$)")

# Adjust layout for better appearance and display the plot
plt.tight_layout()
plt.show()

Because a probability density (mass) function of a discrete random variable must take values in \([0,1]\), its surprisal function is never negative. However, the probability density function of a continuous random variable may take on values larger than \(1\), which means that the associated surprisal density function can be negative. This can be seen for the continuous random variable \(Y\) from the previous section, whose density can exceed \(1\) in regions of high concentration—hence its surprisal can dip below zero.

Code

# Create a figure with two subplots side by side
fig, axes = plt.subplots(ncols=2, figsize=(6, 3))

# Create a grid of y values between 0 and 1 for plotting
grid = np.linspace(0, 1, num=250)

# Plot the marginal probability density function f(y) on the first subplot
axes[0].plot(grid, fy(grid))
axes[0].xaxis.set_major_formatter(
    PercentFormatter(xmax=1)
)  # Format x-axis as percentages
axes[0].set_title(r"marginal density $f(y)$")  # Set subplot title
axes[0].set_ylabel("probability density")  # Label y-axis

# Plot the marginal surprisal s(y) = -log(f(y)) on the second subplot
axes[1].plot(grid, -np.log(fy(grid)))
axes[1].xaxis.set_major_formatter(
    PercentFormatter(xmax=1)
)  # Format x-axis as percentages
axes[1].set_title(r"marginal surprisal $s(y)$")  # Set subplot title
axes[1].set_ylabel("surprisal density")  # Label y-axis

# Add a shared x-axis label for both subplots
fig.supxlabel("test score ($y$)")

# Adjust layout for better appearance and display the plot
plt.tight_layout()
plt.show()

Having defined surprisal for individual outcomes, entropy emerges naturally as its average—capturing the typical “surprise” we can expect.

Definition 2 Let \(X\) and \(Y\) be two random variables with density functions \(f(x)\) and \(f(y)\), respectively.

The entropy of \(X\) is the quantity \[ H(X) = E_{x\sim f(x)}(s(x)). \]
The conditional entropy of \(Y\), given an observed value \(X=x\), is the quantity \[ H(Y\mid X=x) = E_{y\sim f(y|x)}(s(y\mid x)), \] where \(f(y|x)\) is the conditional density of \(Y\) given \(X\).
The conditional entropy of \(Y\), given \(X\), is the quantity

\[ H(Y\mid X) = E_{x\sim f(x)}(H(Y\mid X=x)). \]

In the case that \(X\) is discrete, then the entropy \(H(X)\) is a sum of either a finite or countably infinite number of terms:

\[ H(X) = \sum_{x\in \mathbb{R}} f(x)s(x) = - \sum_{x\in \mathbb{R}} f(x) \log{f(x)}. \]

If \(X\) is continuous, then the entropy is an integral:

\[ H(X) = \int f(x) s(x) \ dx = - \int f(x) \log{f(x)} \ dx, \]

where, by convention, \(\int\) denotes integration over \(\mathbb{R}\). In the literature, the entropy of a continuous random variable is often called differential entropy.

The stats submodule of the SciPy library contains a convenient method called entropy for computing entropies of discrete random variables. We use it to compute the entropy \(H(X)\), where \(X\) is the “hours studied” random variable:

from scipy.stats import entropy

# The values of the probability mass function of X are stored in a
# NumPy array called `fxs`

print(f"The probability mass function f(x) of X is:\n")
for x, p in enumerate(fxs):
    # Print the probability mass for each possible value of X (1 through n)
    print(f"    f({x + 1}) =", round(p, 3))

# Compute and print the entropy H(X) using scipy's entropy function
print(f"\nThe entropy H(X) is {entropy(fxs):.3f}.")

The probability mass function f(x) of X is:

    f(1) = 0.163
    f(2) = 0.244
    f(3) = 0.244
    f(4) = 0.183
    f(5) = 0.11
    f(6) = 0.055

The entropy H(X) is 1.698.

We can use the quad method in the integrate submodule of SciPy to compute differential entropies. For the “exam score” random variable \(Y\), we compute:

from scipy.integrate import quad

# The marginal probability density function of Y is implemented as
# the function `fy`

# Compute the differential entropy H(Y) by integrating -f(y) * log(f(y)) over [0, 1]
diff_entropy, _ = quad(func=lambda y: -fy(y) * np.log(fy(y)), a=0, b=1)

# Print the computed differential entropy value
print(f"The differential entropy H(Y) is {diff_entropy:.3f}.")

The differential entropy H(Y) is -0.131.

Notice that \(H(Y)\) turns out to be negative—a reminder that differential entropy behaves quite differently from its discrete cousin.

Kullback–Leibler divergence and mutual information

In this section, we develop an information-theoretic way to measure how “far apart” two probability distributions are. By way of motivation, we consider two probability measures on a single finite probability space \(\Omega\), so that the two measures have mass functions \(f(\omega)\) and \(g(\omega)\). The metric we’ll use is the mean logarithmic relative magnitude, a measure that captures not the absolute difference between probabilities, but how one probability scales relative to another. To define it, we first define the absolute relative magnitude of the probability \(f(\omega)\) to the probability \(g(\omega)\) as the ratio \(f(\omega)/g(\omega)\). Then, logarithmic relative magnitude refers to the base-\(e\) logarithm of the absolute relative magnitude:

\[ \log\left( \frac{f(\omega)}{g(\omega)} \right). \]

Intuitively, this logarithm tells us the “order of magnitude” difference between \(f(\omega)\) and \(g(\omega)\). If \(f(\omega)\approx e^k\) and \(g(\omega)\approx e^l\), then the log ratio is roughly \(k-l\).

Perhaps the most obvious immediate benefit of introducing the logarithm is that it yields a workable number when \(f(\omega)\) and \(g(\omega)\) each have widely different magnitudes. For example, let’s suppose that the mass functions \(f(\omega)\) and \(g(\omega)\) are given by

\[ f(\omega) = \binom{10}{\omega} (0.4)^\omega(0.6)^{10-\omega} \quad \text{and} \quad g(\omega) = \binom{10}{\omega} (0.9)^\omega(0.1)^{10-\omega} \]

for \(\omega\in \{0,1,\ldots,10\}\). These are the mass functions of a \(\mathcal{B}in(10,0.4)\) and \(\mathcal{B}in(10,0.9)\) random variable, respectively. We then plot histograms for these mass functions, along with histograms of the absolute and logarithmic relative magnitudes:

Code

# Define the possible values of omega (0 through 10)
omegas = np.arange(0, 11)

# Compute the probability mass functions for two Binomial distributions:
# p: Binomial(n=10, p=0.4)
# q: Binomial(n=10, p=0.9)
p = binom(n=10, p=0.4).pmf(omegas)
q = binom(n=10, p=0.9).pmf(omegas)

# Titles for each subplot
titles = [
    "$f(\\omega)$",  # PMF of first distribution
    "$g(\\omega)$",  # PMF of second distribution
    "$\\frac{f(\\omega)}{g(\\omega)}$",  # Ratio of PMFs
    "$\\log\\left(\\frac{f(\\omega)}{g(\\omega)}\\right)$",  # Log-ratio of PMFs
]

# Data to plot in each subplot
probs = [p, q, p / q, np.log(p / q)]

# Y-axis limits for each subplot for better visualization
ylims = [(0, 0.4), (0, 0.4), (-50, 0.75e8), (-10, 20)]

# Create a 2x2 grid of subplots
fig, axes = plt.subplots(ncols=2, nrows=2, figsize=(8, 5), sharex=True)

# Loop over each subplot, plotting the corresponding data
for title, prob, ylim, axis in zip(titles, probs, ylims, axes.flatten()):
    axis.bar(omegas, prob, width=0.4, zorder=2)  # Bar plot for each omega
    axis.set_xticks(ticks=omegas)  # Set x-ticks to omega values
    axis.set_ylim(ylim)  # Set y-axis limits
    axis.set_title(title)  # Set subplot title

# Add a shared x-axis label for all subplots
fig.supxlabel("$\\omega$")

# Adjust layout for better appearance and spacing
plt.tight_layout()
plt.subplots_adjust(hspace=0.5)
plt.show()

The second row makes the point vividly: the absolute relative magnitudes span such wildly different scales that the plot is almost useless, and numerical computations would be unstable. The logarithmic version, by contrast, stays well-behaved and informative.

We obtain a single-number summary of the logarithmic relative magnitudes by taking their mean with respect to the mass function \(f(\omega)\), giving us the number

\[ E_{\omega\sim f(\omega)} \left[\log\left( \frac{f(\omega)}{g(\omega)} \right)\right] = \sum_{\omega\in \Omega} f(\omega) \log\left( \frac{f(\omega)}{g(\omega)} \right). \tag{3}\]

Observe that we could have instead computed the mean relative to \(g(\omega)\), giving us the number

\[ E_{\omega\sim g(\omega)} \left[\log\left( \frac{f(\omega)}{g(\omega)} \right)\right] = \sum_{\omega\in \Omega} g(\omega) \log\left( \frac{f(\omega)}{g(\omega)} \right). \tag{4}\]

But notice that

\[ E_{\omega\sim g(\omega)} \left[\log\left( \frac{f(\omega)}{g(\omega)} \right)\right] = - E_{\omega\sim g(\omega)} \left[\log\left( \frac{g(\omega)}{f(\omega)} \right)\right], \]

where the right-hand side is the negative of a number of the form (3). So, at least up to sign, it doesn’t really matter which of the two numbers (3) or (4) that we use to develop our theory. Our choice of (3) has the benefit of making the KL divergence nonnegative when the distributions are discrete.

These considerations lead us to:

Definition 3 Let \(X\) and \(Y\) be two random variables with density functions \(f(x)\) and \(g(x)\), respectively. The Kullback–Leibler divergence (or just KL divergence) from \(X\) to \(Y\), denoted \(D(X \parallel Y)\), is the mean logarithmic relative magnitude:

\[ D(X \parallel Y) = E_{x\sim f(x)} \left[ \log \left( \frac{f(x)}{g(x)}\right) \right]. \]

Technically, \(D(X\parallel Y)\) is defined only when \(f(x)=0\) implies \(g(x)=0\) for all \(x\)—a condition known as absolute continuity in measure theory. If \(X\) and \(Y\) are continuous, then \(D(X\parallel Y)\) is often called the differential KL divergence.

For some examples of differential KL divergences, let’s consider the conditional random variables \(Y\mid X=x\) from the previous section, which give the exam score \(Y\) of a randomly chosen student if they had studied \(X=x\) hours (for \(x=1,2,\ldots,6\)). In the figure below, we plot the densities \(f(y\mid x)\) of the conditional distributions and compute the five differential KL divergences

\[ D\left( (Y\mid X=1) \parallel (Y\mid X=x) \right) \]

for \(x=2,3,4,5,6\).

Code

# Define the conditional density f(y|x) as the Beta(x, 3) PDF
def cond(y, x):
    return beta.pdf(x=y, a=x, b=3)


# Define the integrand for KL divergence between two Beta distributions:
# cond(y, x1) * log(cond(y, x1) / cond(y, x2))
def integrand(y, x1, x2):
    return cond(y, x1) * np.log(cond(y, x1) / cond(y, x2))


# Compute KL divergence D((Y|X=1) || (Y|X=x+2)) for x = 0,1,2,3,4 (i.e., x2 = 2,3,4,5,6)
KL_div = np.empty(5)
for x in range(5):
    KL_div[x], _ = quad(func=integrand, args=(1, x + 2), a=0, b=1)

# Set up a 2x6 grid for custom subplot arrangement
fig = plt.figure(figsize=(8, 5))
gs = gridspec.GridSpec(2, 6, figure=fig)
ax1 = fig.add_subplot(gs[0, 0:2])
ax2 = fig.add_subplot(gs[0, 2:4])
ax3 = fig.add_subplot(gs[0, 4:6])
ax4 = fig.add_subplot(gs[1, 1:3])
ax5 = fig.add_subplot(gs[1, 3:5])
axes = [ax1, ax2, ax3, ax4, ax5]

# For each subplot, plot the two conditional densities and annotate with KL divergence
for x, ax in enumerate(axes):
    # Plot f(y|x=1) in blue
    ax.plot(grid, cond(grid, 1), color=blue, zorder=2, label="x = 1")
    ax.fill_between(grid, cond(grid, 1), zorder=2, color=blue, alpha=0.1)
    
    # Plot f(y|x=x+2) in yellow
    ax.plot(grid, cond(grid, x + 2), color=yellow, zorder=2, label=f"x = {x + 2}")
    ax.fill_between(grid, cond(grid, x + 2), zorder=2, color=yellow, alpha=0.1)
    
    # Annotate with the computed KL divergence
    ax.set_title(f"KL div. = {KL_div[x]:.3f}")
    ax.set_ylim(0, 4)
    ax.legend(loc="upper right")

# Adjust layout and spacing for better appearance
plt.tight_layout()
plt.subplots_adjust(hspace=0.5)
plt.show()

Each subplot contains the blue density curve \(f(y\mid x=1)\), along with a yellow density curve \(f(y\mid x)\) for \(x=2,3,4,5,6\). As \(x\) gets larger, we can see visually that the densities become more unalike; this increasing dissimilarity is reflected in larger KL divergences as \(x\) gets larger.

At the other end of the spectrum, we have \(D(X \parallel Y) = 0\) when \(X\) and \(Y\) are identically distributed. And, at least when the variables are discrete, it is a basic but important fact that we always have \(D(X \parallel Y)\geq 0\), with equality if and only if \(X\) and \(Y\) are identically distributed; this is referred to as Gibbs’ inequality (see here for a proof). So, the KL divergence has several properties that make it a good measure for the “distance” between two probability distributions. However, note that this distance is not symmetric, in the sense that we have

\[ D(X\parallel Y) \neq D(Y \parallel X) \]

in general.

KL divergence measures how one distribution differs from another. To study relationships between random variables, we apply it to their joint and marginal distributions. If \(X\) and \(Y\) are independent, their joint density factors as the product of their marginals, \(f(x,y)=f(x)f(y)\). Thus, a measure of the “information flow” between \(X\) and \(Y\) is the distance—in the sense of KL divergence—from the true joint density \(f(x,y)\) to the product densities \(f(x)f(y)\). This leads us to:

Definition 4 Let \(X\) and \(Y\) be two random variables with density functions \(f(x)\) and \(f(y)\). The mutual information shared between \(X\) and \(Y\), denoted \(I(X,Y)\), is the quantity

\[ I(X,Y) = E_{(x,y)\sim f(x,y)} \left[ \log \left( \frac{f(x,y)}{f(x)f(y)} \right) \right]. \tag{5}\]

The product \(f(x)f(y)\) is the density of some probability distribution on \(\mathbb{R}^2\), which would coincide with the true joint probability distribution (with density \(f(x,y)\)) if the variables were independent. So, the mutual information is the KL divergence between two probability distributions on \(\mathbb{R}^2\), and we have \(I(X,Y)=0\) when \(X\) and \(Y\) are independent.

As an example, we return once more to our random variables \(X\) and \(Y\), the “hours studied” discrete variable and the “exam score” continuous variable. In this case, the joint distribution of \(X\) and \(Y\) is a mixed discrete-continuous one, so the formula (5) gives

\[ I(X,Y) = \sum_{x=1}^6 \int_0^1 f(x,y) \log\left(\frac{f(x,y)}{f(x)f(y)} \right) \ dy, \]

where \(f(x,y)\) is the true joint mass-density function and \(f(x)\) and \(f(y)\) are the marginal mass and densities, respectively. We implement this formula directly in Python, using the quad method in the integrate submodule of SciPy for integration:

# The conditional density of Y|X=x is implemented as the
# function `cond(y, x)`, the marginal density of Y is
# the function `fy(y)`, and the marginal probability masses
# of X are contained in a NumPy array called `fxs`.

# Compute the true joint density f(x, y) = P(X=x) * f(y|x)
def fxy(x, y):
    return fxs[x - 1] * cond(y, x)

# Compute the product of marginals: P(X=x) * f(y)
def prod(x, y):
    return fxs[x - 1] * fy(y)

# Define the integrand for mutual information:
# f(x, y) * log(f(x, y) / (P(X=x) * f(y)))
def integrand(x, y):
    return fxy(x, y) * np.log(fxy(x, y) / prod(x, y))

# For each x, create a function of y for integration over y in [0, 1]
funcs = [lambda y, x=x: integrand(x, y) for x in range(1, 7)]

# Compute the mutual information by summing the integrals over y for each x
mutual_info = sum([quad(func, a=0, b=1)[0] for func in funcs])

# Print the computed mutual information I(X, Y)
print(f"The mutual information I(X,Y) is {mutual_info:.3f}.")

The mutual information I(X,Y) is 0.201.

Using the definitions of marginal and conditional entropies given in Definition 2, one easily proves that the mutual information \(I(X,Y)\) may be computed as described in:

Theorem 1 (Mututal information is entropy) Let \(X\) and \(Y\) be two random variables. Then

\[ I(X,Y) = H(Y) - H(Y\mid X). \tag{6}\]

Thus, the mutual information measures the amount of entropy in \(Y\) that is “leftover” after having observed \(X\). In other words, it quantifies how much knowing \(X\) reduces uncertainty about \(Y\).

We end this section by using formula (6) to re-do our computation of the mutual information \(I(X,Y)\) from above. We get the same answer:

# the differential entropy H(Y) is stored in `diff_entropy`

# For each x, define a function of y for the conditional entropy integrand: -f(y|x) * log(f(y|x))
funcs = [lambda y, x=x: -cond(y, x) * np.log(cond(y, x)) for x in range(1, 7)]

# Compute the conditional entropy H(Y|X=x) for each x by integrating over y in [0, 1]
cond_entropies = [quad(func, a=0, b=1)[0] for func in funcs]

# Compute H(Y) - sum_x P(X=x) * H(Y|X=x), which equals the mutual information I(X, Y)
diff_entropy - sum([cond_entropy * fx for cond_entropy, fx in zip(cond_entropies, fxs)])

# Print the previously computed mutual information for comparison
print(f"The mutual information I(X,Y) is {mutual_info:.3f}.")

The mutual information I(X,Y) is 0.201.

Mutual information of jointly normal random variables

Useful intuition for the mutual information \(I(X,Y)\) arises from the simple case of jointly normal variables, where a closed-form expression can be obtained. As a first step toward this formula, we compute the differential entropy of a single normal random variable:

For background on normal random vectors, see here.

Theorem 2 (Entropy of a normal random variable) If \(X\sim \mathcal{N}(\mu,\sigma^2)\), then

\[ H(X) = \frac{1}{2}\log(2\pi e \sigma^2), \]

where \(e\) is the base of the natural logarithm.

Proof.

Letting \(f(x)\) be the density of \(X\), we compute:

\[ \begin{align*} H(X) &= -\int f(x) \log{f(x)} \ dx \\ &= - \int f(x) \log\left\{\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[ -\frac{1}{2\sigma^2}(x-\mu)^2 \right]\right\} \ dx \\ &= - \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) \int f(x) \ dx + \frac{1}{2\sigma^2}\int f(x)(x-\mu)^2 \ dx \\ &= \frac{1}{2}\log(2\pi \sigma^2) + \frac{1}{2} \\ &= \frac{1}{2}\log(2\pi e \sigma^2) \end{align*} \]

where we’ve used \(\int f(x) \ dx =1\) and \(\int f(x)(x-\mu)^2 \ dx = \sigma^2\).

It is well known that the conditional distributions of a normal random vector are themselves normal; we include here the result for a \(2\)-dimensional normal random vector. The proof follows directly from standard properties of multivariate normal distributions and is omitted for brevity.

Theorem 3 (Conditional distributions of normal vectors are normal) Let \((X,Y) \sim \mathcal{N}_2(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) be a \(2\)-dimensional normal random vector with

\[ \boldsymbol{\mu}= \begin{bmatrix} \mu_X \\ \mu_Y \end{bmatrix} \quad \text{and} \quad \boldsymbol{\Sigma}= \begin{bmatrix} \sigma_X^2 & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{bmatrix}, \]

where \(X \sim \mathcal{N}(\mu_X,\sigma_X^2)\), \(Y\sim \mathcal{N}(\mu_Y,\sigma_Y^2)\), and \(\rho\) is the correlation of \(X\) and \(Y\). Then

\[ (Y \mid X=x) \sim \mathcal{N}\left(\mu_Y + (x-\mu_X) \frac{\rho \sigma_Y}{\sigma_X}, \ \sigma_Y^2(1-\rho^2) \right) \]

for all \(x\).

The next result contains the formula for the mutual information of two jointly normal random variables; its proof is an easy application of our previous results. Notice the mutual information only depends on the correlation between the variables, as intuition might suggest.

Theorem 4 (Mutual information of jointly normal variables) Let \((X,Y) \sim \mathcal{N}_2(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) be a \(2\)-dimensional normal random vector. Then

\[ I(X,Y) = -\frac{1}{2} \log \left(1-\rho^2 \right), \]

where \(\rho\) is the correlation of \(X\) and \(Y\).

Proof.

From Theorem 2 and Theorem 3, we get that

\[ H(Y\mid X=x) = \frac{1}{2}\log\left(2\pi e \sigma_Y^2(1-\rho^2)\right), \]

where \(\sigma_Y\) is the standard deviation of \(Y\). Since this does not depend on \(x\), we have

\[ H(Y\mid X) = E_{x\sim f(x)}\left(H(Y\mid X=x) \right) = \frac{1}{2}\log\left(2\pi e \sigma_Y^2(1-\rho^2)\right), \]

where \(f(x)\) is the marginal density of \(X\). But another application of Theorem 2 gives

\[ H(Y) = \frac{1}{2}\log(2\pi e \sigma_Y^2), \]

and so by Theorem 1 we have

\[ \begin{align*} I(X,Y) &= H(Y) - H(Y\mid X) \\ &= \frac{1}{2}\log(2\pi e \sigma_Y^2) - \frac{1}{2}\log\left(2\pi e \sigma_Y^2(1-\rho^2)\right) \\ &= -\frac{1}{2} \log \left(1-\rho^2 \right). \end{align*} \]

Hence, the larger the correlation between the marginal normals, the larger the mutual information. To see this in action with a concrete example, let’s suppose that we have \((X,Y) \sim \mathcal{N}_2(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), where

\[ \boldsymbol{\mu}= \begin{bmatrix} 0 \\ 0 \end{bmatrix} \quad \text{and} \quad \boldsymbol{\Sigma}= \begin{bmatrix} 1 & 2\rho \\ 2\rho & 4 \end{bmatrix}, \]

and \(\rho\) is the correlation of \(X\) and \(Y\) (hence the marginal standard deviations are \(\sigma_X = 1\) and \(\sigma_Y=2\)). In the second and third plots below, we have selected two correlations \(\rho=0.5,0.85\) and computed the corresponding mutual information \(I(X,Y)\). The isoprobability contours of the joint normal density \(f(x,y)\) are shown in yellow, while the conditional normal densities \(f(y|x)\) are shown in blue for each of \(x=-1, 0, 1\). For comparison, the marginal density \(f(y)\) has been shown in the first plot.

Code

# Define a function to plot contours of a bivariate normal distribution
def plot_multivar_norm(ax, muX, muY, sigmaX, sigmaY, x, y, labels=False):
    # Construct the covariance matrix using the specified correlation rho
    Sigma = np.array(
        [[sigmaX**2, rho * sigmaX * sigmaY], [rho * sigmaX * sigmaY, sigmaY**2]]
    )
    Mu = np.array([muX, muY])  # Mean vector
    U = multivariate_normal(mean=Mu, cov=Sigma)  # Multivariate normal object
    grid = np.dstack((x, y))  # Create a grid for evaluation
    z = U.pdf(grid)  # Evaluate the PDF on the grid
    contour = ax.contour(x, y, z, colors=yellow, alpha=0.3)  # Plot contours
    if labels:
        ax.clabel(contour, inline=True, fontsize=8)  # Optionally label contours


# Define a function to plot the conditional density f(y|x) for a given x_obs
def plot_conditional(
    ax, muX, muY, sigmaX, sigmaY, rho, y_mesh, x_obs, magnification_factor=1
):
    # Compute conditional mean and standard deviation for Y|X=x_obs
    mu = muY + (x_obs - muX) * rho * sigmaY / sigmaX
    sigma = sigmaY * np.sqrt(1 - rho**2)
    # Compute and scale the conditional normal density
    x = magnification_factor * norm(loc=mu, scale=sigma).pdf(y_mesh)
    # Plot the conditional density horizontally, shifted to align with x_obs
    ax.plot(-x + x_obs, y_mesh, color=blue)
    ax.fill_betweenx(y_mesh, -x + x_obs, x_obs, color=blue, alpha=0.4)


# Set parameters for the bivariate normal distribution
muX = 0
muY = 0
sigmaX = 1
sigmaY = 2
rhos = [0.5, 0.85]  # Correlation values to illustrate
x_obs = [-1, 0, 1]  # Observed x values for conditional plots
x, y = np.mgrid[-2.1:2.1:0.01, -5:5:0.01]  # Grid for contour plot
y_mesh = np.linspace(-5, 5, num=250)  # Grid for conditional densities

# Create a figure with three subplots: one for the marginal, two for different correlations
fig, axes = plt.subplots(
    ncols=3, figsize=(8, 4), sharey=True, gridspec_kw={"width_ratios": [1, 4, 4]}
)

# Plot the marginal density of Y on the first subplot (as a horizontal density)
magnification_factor = 2.5
x_marginal = magnification_factor * norm(scale=sigmaY).pdf(y_mesh)
axes[0].plot(-x_marginal, y_mesh, color=blue)
axes[0].set_xlim(-1, 0)
axes[0].fill_betweenx(y_mesh, -x_marginal, 0, color=blue, alpha=0.4)
axes[0].yaxis.tick_right()
axes[0].spines["left"].set_visible(False)
axes[0].spines["right"].set_visible(True)
axes[0].spines["bottom"].set_visible(False)
axes[0].set_xticks([])

# For each correlation value, plot the joint contours and conditional densities
for rho, ax in zip(rhos, axes[1:]):
    plot_multivar_norm(ax, muX, muY, sigmaX, sigmaY, x, y)
    for x_ob in x_obs:
        plot_conditional(
            ax,
            muX,
            muY,
            sigmaX,
            sigmaY,
            rho,
            y_mesh,
            x_obs=x_ob,
            magnification_factor=3,
        )
    # Compute and display the mutual information for this correlation
    info = -(1 / 2) * np.log(1 - rho**2)
    ax.set_title(rf"$\rho ={rho}$, $I(X,Y)= {info:0.3f}$")
    ax.set_xlabel(r"$x$")
    ax.set_xlim(-2.2, 2.2)
    ax.set_xticks(range(-2, 3))

# Label the y-axis on the first subplot
axes[0].set_ylabel(r"$y$")
plt.tight_layout()
plt.show()

Compared to the marginal distribution of \(Y\), the conditional distributions become increasingly concentrated as \(\rho\) increases. This stronger concentration reflects the reduced uncertainty in \(Y\) when \(X\) is observed. This example illustrates the intuition behind mutual information: greater correlation implies stronger dependence, smaller conditional entropy, and thus higher mutual information.

Bibliography

Jaynes, E. T. 1957. “Information Theory and Statistical Mechanics.” The Physical Review 106 (4): 620–30.

Rioul, O. 2021. “This Is IT: A Primer on Shannon’s Entropy and Information.” In Information Theory: Poincaré Seminar 2018, 49–86. Springer.

Shannon, C. E. 1948. “A Mathematical Theory of Communication.” The Bell System Technical Journal 27 (3): 379–423.