Mutual Information

Definition. Given two random variables $X,Y$ and their joint distribution $P_ {XY},$ the mutual information between $X$ and $Y$ is defined as

$$I(X;Y):=D_{\mathrm{KL}}(P_{XY}\Vert P_XP_Y),$$

which is the KL divergence of $P_{XY}$ with respect to a hypothetical distribution assuming $X$ and $Y$ are independent.

Properties of mutual information. The following statements are true:

  • $I(X;Y) = D_ {\mathrm{KL}}(P_ {Y\vert X}\Vert P_ Y\vert P_ X) = \mathbb{E}_ {X\sim P_ X}[D_{\mathrm{KL}}(P_{Y\vert X}\Vert P_Y)];$
  • (Symmetry) $I(X;Y) = I(Y;X);$
  • (Measure of dependency) $I(X;Y)\geq 0.$ The equality holds if and only if $X\perp Y.$

Akin to the property of $f$-divergence, we have the data processing inequality for mutual information applied on Markov chains:

Theorem 1. Let $X\to Y\to Z$ form a Markov chain, i.e., $Z\perp X\ \vert\ Y.$ Then

$$I(X;Z)\leq I(X;Y).$$

Proof. From Markovian property, we have $P_ {Z\vert X} = P_ {Z\vert Y}P_ {Y\vert X}.$ Then it follows from the data processing inequality for Kullback-Leibler divergence that

$$D_{\mathrm{KL}}(P_{Z\vert X}\Vert P_Z) \leq D_{\mathrm{KL}}(P_{Y\vert X}\Vert P_Y).$$

Taking expectation with respect to $X\sim P_X$ on both sides of this equality yields the result.

Fano’s Inequality

Let $\mathcal{P}$ be a set of distributions on a sample space $\mathcal{X},$ and let $\mathbf{X}=\lbrace X_ 1,\cdots,X_ n\rbrace$ be i.i.d. from some unknown $P\in\mathcal{P}.$ We are interested in recovering $P$ from our observations $\mathbf{X}.$ In a parametric model, $P=P_ \theta$ is characterized by some parameter $\theta\in\Theta,$ where $\Theta$ is the parameter space. For example, in univariate Gaussian model, $\theta = (\mu,\sigma^2)\in\mathbb{R}\times\mathbb{R}_ {+}.$ For nonparametric models, we can estimate some functional $\theta=\theta(P)$ of distribution $P,$ e.g. the mean, variance, median and quantiles.

Now we consider a problem of multiple hypothesis testing. Given a subset $\mathcal{S}=\lbrace P_1,\cdots,P_m\rbrace\subset\mathcal{P},$ we wonder the distribution from which our observations $\mathbf{X}$ were drawn. Typically, testing is an easier problem than estimation or regression, since we only need to distinguish the target distribution from alternative choices.

We conduct our study from a Bayesian perspective. We assume that the “true” distribution is equiprobably chosen from $P_1,\cdots,P_m.$ Formally, we introduce a pseudo random variable $J$ and impose a uniform distribution: $J\sim\mathrm{Unif}(\lbrace 1,\cdots,m\rbrace).$ Conditional on a realization $J=j,$ the observations $\mathbf{X}=\lbrace X_1,\cdots,X_n\rbrace$ was drawn from distribution $P_j,$ i.e., $P_{\mathbf{X}\vert J=j}=P_j^{\otimes n}.$ Based on observations $\mathbf{X},$ we construct a test $\psi:\mathcal{X}^n\to\lbrace 1,\cdots,m\rbrace.$ Hence $\psi=\psi(\mathbf{X})$ is an estimation of $J$ (or the realization $j$).

A motivating example. Let $\theta:\mathcal{P}\to\mathbb{\Theta}$ be some functional of distributions in $\mathcal{P},$ where $(\Theta,d)$ is a metric space. Based on observations $\mathbf{X}=\lbrace X_1,\cdots,X_n\rbrace,$ we construct an estimator $\hat{\theta}(\mathbf{X}),$ where $\hat{\theta}:\mathcal{X}^n\to\Theta$ is a functional on the sample space. Then for multiple hypothesis testing,

$$\psi(\mathbf{X}) = \underset{j\in\lbrace 1,\cdots,m\rbrace}{\mathrm{argmin}}\ d(\hat{\theta}(\mathbf{X}),\theta_j),$$

where $\theta_j = \theta(P_j).$ In a nutshell, we choose the nearest neighbor to our estimation.

Theorem 2 (Fano’s inequality). For any map $\psi:\mathcal{X}^n\to\lbrace 1,\cdots,m\rbrace,$ we have

$$\frac{1}{m}\sum_{j=1}^m P_j(\psi\neq j) \geq 1 - \frac{n\beta + \log 2}{\log m},$$

where $\beta = \sup_{j\neq k}D_{\mathrm{KL}}(P_j\Vert P_k).$

Proof. Note that $J\to\mathbf{X}\to\psi$ form a Markov chain, we have

$$\begin{align*} I(J;\psi) \leq I(J;\mathbf{X}) &= \mathbb{E}_{J\sim\mathrm{Unif}(\lbrace 1,\cdots,m\rbrace)}\left[D_\mathrm{KL}(P_{\mathbf{X}|J}\Vert P_\mathbf{X})\right]\\ &= \frac{1}{m}\sum_{j=1}^m D_\mathrm{KL}\left(P_j^{\otimes n}\Vert\frac{1}{m}\sum_{k=1}^n P_k^{\otimes n}\right)\\ &\leq \frac{1}{m^2}\sum_{j=1}^m\sum_{k=1}^m D_\mathrm{KL}(P_j^{\otimes n}\Vert P_k^{\otimes n})\\ &= \frac{1}{m^2}\sum_{j=1}^m\sum_{k=1}^m nD_\mathrm{KL}(P_j\Vert P_k) \leq n\beta. \end{align*}\tag{1}$$

The first inequality follows from the data processing inequality of mutual information, and the second from Jensen’s inequality.

Moreover, we define a processed variable $Y = \mathbb{1}\lbrace J=\psi\rbrace.$ Denote by $p_e$ the error rate, i.e.

$$p_e = 1 - \mathbb{E}_{X,\psi\sim P_{X,\psi}}[Y] = \frac{1}{m}\sum_{j=1}^m P_j(\psi\neq j).$$

From the data processing inequality for Kullback-Leibler divergence, we have

$$\begin{align*} I(J;\psi) &= D_{\mathrm{KL}}(P_{J,\psi}\Vert P_JP_\psi)\\ &\geq D_{\mathrm{KL}}(\mathrm{Bernoulli}(1 - p_e)\Vert\mathrm{Bernoulli}(1/m))\\ &= (1-p_e)\log\frac{1-p_e}{1/m} + p_e\log\frac{p_e}{1-\frac{1}{m}}\\ &= p_e\log p_e + (1-p_e) \log (1-p_e) + \log m -p_e\log(m-1)\\ &\geq -\log 2 + (1-p_e)\log m \end{align*}\tag{2}$$

Combining $(1)$ and $(2)$ yields

$$p_e \geq 1 - \frac{I(J;\psi) + \log 2}{\log m} \geq 1 - \frac{n\beta + \log 2}{\log m},$$

which concludes the proof.


<
Previous Post
F-divergences
>
Next Post
Vector-Valued RKHS