Formulation

Definition. Let $P$ and $Q$ be two probability measures on a measurable space $(\mathcal{X},\Sigma).$ Then for any convex function $f:\mathbb{R}^+\to\mathbb{R}$ such that $f(1)=0$ and $f$ is strictly convex at $1.$ Suppose that $Q$ is absolutely continuous with respect to $P.$ Then the $f$-divergence of $Q$ with respect to $P$ is defined as

$$D_f(Q\Vert P) := \int f\left(\frac{\mathrm{d}Q}{\mathrm{d}P}\right)\mathrm{d}P,$$

where the notation $\frac{\mathrm{d}Q}{\mathrm{d}P}$ stands for the Radon-Nikodym derivative of $Q$ with respect to $P.$

Remark. In practice, we often use the following two forms of $f$-divergence:

  • When $\mathcal{X}$ is discrete, $P$ and $Q$ are probability mass functions:

    $$D_f(Q\Vert P) = \sum_{x\in\mathcal{X}}f\left(\frac{Q(x)}{P(x)}\right)P(x).$$

  • When $P$ and $Q$ are specified by density functions $p$ and $q,$ respectively, then

    $$D_f(q\Vert p) = \int f\left(\frac{q(x)}{p(x)}\right)p(x)\mathrm{d}x.$$

Properties of $f$-divergences

  • (Non-negativity). $D_f(Q\Vert P)\geq 0$ with equality holds if and only if $P=Q.$

    Proof. From Jensen’s inequality, the convexity of $f$ implies

    $$D_f(Q\Vert P) = \mathbb{E}_P\left[f\left(\frac{\mathrm{d}Q}{\mathrm{d}P}\right)\right] \geq f\left(\mathbb{E}_P\left[\frac{\mathrm{d}Q}{\mathrm{d}P}\right]\right) = f(1) = 0.$$

    Plus the strict convexity of $f$ at $1,$ the equality holds if and only if $P=Q.$

  • (Joint convexity). $(Q,P)\mapsto D_f(Q\Vert P)$ is a jointly convex function.

    Proof. Let $f:\mathbb{R}^+\to\mathbb{R}$ be convex in the open set $(0,+\infty).$ Then the perspective $(p,q)\mapsto pf(q/p)$ of $f,$ defined on $\mathbb{R}_ {+}^2,$ is also convex. Furthermore, $D_ f(Q\Vert P)$ as the integral taken on a collection of perspectives, is jointly convex.

  • (Conditional increment). Define the conditional $f$-divergence:

    $$D_f(Q_{Y|X}\Vert P_{Y|X}|P_X):= \mathbb{E}_{X\sim P_X}\left[D_f(Q_{Y|X}\Vert P_{Y|X})\right].$$

    Based on $P_ X$, the conditional distributions $P_{Y\vert X}$ and $Q_{Y\vert X}$ induce marginal distributions $P_Y = \mathbb{E}_ {X\sim P_X}[P_{Y\vert X}],$ $Q_Y = \mathbb{E}_ {X\sim P_X}[Q_{Y\vert X}].$ Then

    $$D_f(Q_Y\Vert P_Y)\leq D_f(Q_{Y|X}\Vert P_{Y|X}|P_X).$$

    This is a specific form of Jensen’s inequality.

Examples

  • Total variation distance. $f(x)=\frac{1}{2}\vert x-1\vert:$

    $$d_{\mathrm{TV}}(P,Q) = \frac{1}{2}\int\left\vert\frac{\mathrm{d}Q}{\mathrm{d}P} - 1\right\vert\mathrm{d}P = \frac{1}{2}\int\vert\mathrm{d}Q - \mathrm{d}P\vert.$$

  • Kullback-Leibler divergence. $f(x) = x\log x:$

    $$D_{\mathrm{KL}}(Q\Vert P) = \int\log\left(\frac{\mathrm{d}Q}{\mathrm{d}P}\right)\mathrm{d}Q.$$

  • Squared Hellinger distance. $f(x) = (1-\sqrt{x})^2:$

    $$H^2(P,Q) = \int\left(1-\sqrt{\frac{\mathrm{d}Q}{\mathrm{d}P}}\right)^2\mathrm{d}P = \int\left(\sqrt{\mathrm{d}P} - \sqrt{\mathrm{d}Q}\right)^2.$$

Variational Representation

The variational form of $f$-divergence is based on the Fenchel conjugate of $f,$ which is defined as $f^*(t) = \sup_{x\in\mathbb{R}^+}\lbrace tx - f(x)\rbrace,\ t\in\mathbb{R}.$ The $f$-divergence admits the following variational formulation.

Lemma 1 (Variational representation of $f$-divergence). Let $\Phi$ be the convex class of measurable functions $\phi:\mathcal{X}\to\mathbb{R}.$ Then

$$D_f(Q\Vert P) = \sup_{\phi\in\Phi} \left\lbrace\mathbb{E}_{X\sim Q}[\phi(X)] - \mathbb{E}_{X\sim P}[f^*(\phi(X))]\right\rbrace,$$

If $f$ is differentiable, then the supremum is reached at $\widetilde{\phi} = f’(\mathrm{d}Q/\mathrm{d}P).$

Proof. We fix the measurable function $\phi\in\Phi.$ By Fenchel’s duality, we have

$$\phi(x)\frac{\mathrm{d}Q(x)}{\mathrm{d}P(x)} - f\left(\frac{\mathrm{d}Q(x)}{\mathrm{d}P(x)}\right)\leq f^*(\phi(x)).$$

Take integration with respect to $P$ on both sides of the equation above, we have

$$\mathbb{E}_{Z\sim Q}[\phi(Z)] - D_f(Q\Vert P) \leq \mathbb{E}_{X\sim P}[f^*(\phi(X))].$$

Since $\phi$ is arbitrarily chosen, we immediately conclude the inequality in Lemma 1.

Examples

  • Total variation distance. $f(x)=\frac{1}{2}\vert x-1\vert,$ then

    $$f^*(t)=\sup_x\left\lbrace tx - \frac{1}{2}\vert x-1\vert\right\rbrace = \begin{eqnarray} t,\ \vert t\vert \leq 1/2,\\ +\infty,\ \vert t\vert > 1/2\end{eqnarray},$$

    and

    $$d_{\mathrm{TV}}(P,Q) = \sup_{\phi:\Vert\phi\Vert_\infty \leq \frac{1}{2}} \left\lbrace\int\phi\mathrm{d}P - \int\phi\mathrm{d}Q\right\rbrace.$$

  • Kullback-Leibler divergence. $f(x)=x\log x,$ $f^*(t) = \mathrm{e}^{t-1},$

    $$D_{\mathrm{KL}}(Q\Vert P) = 1 + \sup_\phi \int\left\lbrace\phi(x)\mathrm{d}Q(x) - \int\exp\lbrace\phi(x)\rbrace\mathrm{d}P(x)\right\rbrace.$$

Data Processing Inequality

We consider a channel that generates a random variable $Y$ given $X$ based on a conditional distribution $P_{Y\vert X}.$

Theorem 1 (Data processing inequality). Fix the conditional distribution $P_ {Y\vert X}$ of $Y$ given $X.$ Denote by $P_Y$ the marginal distribution of $Y$ when $X$ is generated from distribution $P_X,$ and $Q_Y$ when $X$ is from $Q_X.$ Then for any $f$-divergence $D_f(\cdot\Vert\cdot),$

$$\begin{align*} D_f(Q_Y\Vert P_Y) \leq D_f(Q_X\Vert P_X). \end{align*}$$

Proof. Denote the joint distribution of $X$ and $Y$ by $P_ {XY}$ when $X$ is generated from $P_X,$ and $Q_ {XY}$ from $Q_ X.$ Since the conditional distribution $P_ {Y\vert X}$ is fixed, we have

$$\begin{align*} D_f(Q_X\Vert P_X) = D_f(Q_{XY}\Vert P_{XY}). \end{align*}$$

Since $f$ is convex, we have from Jensen’s inequality that

$$\begin{align*} D_f(Q_{XY}\Vert P_{XY}) &= \mathbb{E}_{X,Y\sim P_{XY}}\left[f\left(\frac{\mathrm{d}Q_{XY}}{\mathrm{d}P_{XY}}\right)\right]\\ &= \mathbb{E}_{Y\sim P_Y}\left[\mathbb{E}_{X\sim P_{X\vert Y}}\left[f\left(\frac{\mathrm{d}Q_{XY}}{\mathrm{d}P_{XY}}\right)\right]\right]\\ &\geq \mathbb{E}_{Y\sim P_Y}\left[f\left(\mathbb{E}_{X\sim P_{X|Y}}\left[\frac{\mathrm{d}Q_{XY}}{\mathrm{d}P_{XY}}\right]\right)\right]\\ &= \mathbb{E}_{Y\sim P_Y}\left[f\left(\frac{\mathrm{d}Q_{Y}}{\mathrm{d}P_{Y}}\right)\right] = D_f(Q_Y\Vert P_Y). \end{align*}$$

Remark. An intuitive interpretation of this inequality is that processing $X$ makes it more difficult to distinguish between $P_X$ and $Q_X.$ If we take $P_{Y\vert X}$ to be a Dirac measure, then there is a deterministic map $f$ such that $Y=f(X)$ with probability 1.

Total Variation

Let $f(x)=\frac{1}{2}\vert x-1\vert,$ we obtain the total variation:

$$d_{\mathrm{TV}}(P,Q) = D_f(Q\Vert P) = \frac{1}{2}\int\left\vert\frac{\mathrm{d}Q}{\mathrm{d}P}-1\right\vert\mathrm{d}P = \frac{1}{2}\int\vert\mathrm{d}Q-\mathrm{d}P\vert.$$

Note that $d_{\mathrm{TV}}(\cdot,\cdot)$ is symmetric and bounded in $[0,1].$

Properties of total variation

Denote by $(\mathcal{X},\Sigma)$ the underlying measurable space in our discussion. $P$ and $Q$ are two probability measures.

  • $d_{\mathrm{TV}}(P,Q) = \sup_{\mathcal{E}\in\Sigma} P(\mathcal{E}) - Q(\mathcal{E}),$ where the supremum is taken over all measurable sets.

    Proof. Since $P$ and $Q$ are two probability measure on $(\mathcal{X},\Sigma),$ $P-Q$ is a finite signed measure on $(\mathcal{X},\Sigma).$ By Hahn decomposition theorem, there exists two measurable sets $\mathcal{P}$ and $\mathcal{N}$ such that

    1. $\mathcal{P}\cup\mathcal{N}=\mathcal{X}$ and $\mathcal{P}\cap\mathcal{N}=\emptyset,$
    2. $P(\mathcal{E}) - Q(\mathcal{E}) \geq 0$ for any $\mathcal{E}\in\Sigma$ with $\mathcal{E}\subset\mathcal{P},$ and
    3. $P(\mathcal{E}) - Q(\mathcal{E}) \leq 0$ for any $\mathcal{E}\in\Sigma$ with $\mathcal{E}\subset\mathcal{N}.$

    For any $\mathcal{E}\in\Sigma,$ since $\mathcal{P}$ and $\mathcal{N}$ is a disjoint cover of $\mathcal{X},$ we have

    $$\begin{align*} P(\mathcal{E}) - Q(\mathcal{E}) &= \int_{\mathcal{E}} \mathrm{d}[P-Q]\\ &= \int_{\mathcal{E}\cap\mathcal{P}} \mathrm{d}[P-Q] + \int_{\mathcal{E}\cap\mathcal{N}} \mathrm{d}[P-Q]\\ &\leq \int_{\mathcal{E}\cap\mathcal{P}} \mathrm{d}[P-Q]\\ &\leq \int_{\mathcal{P}} \mathrm{d}[P-Q]. \end{align*}$$

    Moreover, we have for the total variation that

    $$\begin{align*} d_{\mathrm{TV}}(P,Q) &= \frac{1}{2}\int_{\mathcal{P}}\mathrm{d}[P-Q] - \frac{1}{2}\int_{\mathcal{N}}\mathrm{d}[P-Q]\\ &= \int_{\mathcal{P}}\mathrm{d}[P-Q]. \end{align*}$$

    Hence $d_{\mathrm{TV}}(P,Q) = \sup_{\mathcal{E}\in\Sigma} P(\mathcal{E}) - Q(\mathcal{E}),$ where the supremum is achieved when $\mathcal{E}=\mathcal{P}.$

  • Let $\mathcal{F}=\lbrace f:\mathcal{X}\to\mathbb{R},\Vert f\Vert_\infty \leq 1\rbrace.$ Then

    $$d_{\mathrm{TV}}(P,Q) = \frac{1}{2}\sup_{f\in\mathcal{F}} \int f\mathrm{d}P-f\mathrm{d}Q.$$

    It can be viewed as a direct corollary from Holder’s inequality, or a variant of the variational representation.

Inequalities between $f$-divergences

Pinsker’s Inequality

The Pinsker’s inequality gives an upper bound of total variation in terms of the Kullback-Leibler divergence.

Theorem 2 (Pinsker’s inequality). If $P$ and $Q$ are two probability measures on a measurable space $(\mathcal{X},\Sigma)$ with $Q\ll P,$ then

$$d_{\mathrm{TV}}(P,Q) \leq \sqrt{\frac{1}{2}D_{\mathrm{KL}}(Q\Vert P)}$$

Proof. We first show the case of two Bernoulli distributions. Let $P=\mathrm{Bernoulli}(p)$ and $Q=\mathrm{Bernoulli}(q)$ with $p,q\in (0,1),$ then

$$\begin{align*} D_{\mathrm{KL}}(Q\Vert P) &= q\log\frac{q}{p} + (1-q)\log\frac{1-q}{1-p} \\ &= q\int_p^q \frac{\mathrm{d}t}{t} - (1-q)\int_p^q\frac{\mathrm{d}t}{1-t}\\ &= \int_p^q\frac{q-t}{t(1-t)}\mathrm{d}t\\ &\geq 4\int_p^q (q-t)\mathrm{d}t = 2\vert p-q\vert^2 = 2d^2_{\mathrm{TV}}(P,Q).\tag{*} \end{align*}$$

Since $Q\ll P,$ then $p=0$ implies $q=0,$ and $p=1$ implies $q=1.$ We can extend $(*)$ to the case $p,q\in [0,1]$ by defining $0\log 0 = 0.$

Now we consider the general case. For an arbitrary set $\mathcal{E}\in\Sigma,$ define $Y=\mathbb{1}\lbrace X\in\mathcal{E}\rbrace.$ Then $Y\sim\mathrm{Bernoulli}(P_X(\mathcal{E})),$ and we have

$$\begin{align*}\sqrt{\frac{1}{2}D_\mathrm{KL}(Q\Vert P)} &\geq \sqrt{\frac{1}{2}D_\mathrm{KL}\left(\mathrm{Bernoulli}(Q(\mathcal{E}))\Vert\mathrm{Bernoulli}(P(\mathcal{E}))\right)}\\ &\geq d_{\mathrm{TV}}(\mathrm{Bernoulli}(P(\mathcal{E})),\mathrm{Bernoulli}(Q(\mathcal{E}))) = \vert P(\mathcal{E})-Q(\mathcal{E})\vert,\end{align*}$$

where the first inequality follows from the data processing inequality, and the second from $(*).$

Then

$$\sqrt{\frac{1}{2}D_\mathrm{KL}(Q\Vert P)} \geq \sup_{\mathcal{E}\in\Sigma} \vert P(\mathcal{E})-Q(\mathcal{E})\vert = d_{\mathrm{TV}}(P,Q),$$

which concludes the proof.

A refinement of Pinsker’s inequality is presented below.

Theorem 3 (The Bretagnolle-Huber bound). If $P$ and $Q$ are two probability measures on a measurable space $(\mathcal{X},\Sigma)$ with $Q\ll P,$ then

$$d_{\mathrm{TV}}(P,Q) \leq \sqrt{1 - \exp\left(-D_{\mathrm{KL}}(Q\Vert P)\right)} \leq 1 - \frac{1}{2} \exp\left(-D_{\mathrm{KL}}(Q\Vert P)\right).$$

Proof. Analogous to the previous discussion, it suffices to verify the Bernoulli case:

$$\begin{align*} D_{\mathrm{KL}}(Q\Vert P) &= q\log\frac{q}{p} + (1-q)\log\frac{1-q}{1-p}\\ &= -2q\log\sqrt{\frac{p}{q}} - 2(1-q)\log\sqrt{\frac{1-p}{1-q}}\\ &\geq -2\log\left(\sqrt{pq} + \sqrt{(1-p)(1-q)}\right), \end{align*}$$

where the inequality follows from the convexity of function $x\mapsto -\log x.$ Then, we have

$$\begin{align*} \exp\left\lbrace -D_{\mathrm{KL}}(Q\Vert P)\right\rbrace &\leq \left(\sqrt{pq} + \sqrt{(1-p)(1-q)}\right)^2\\ &\leq \left(\sqrt{pq} + \sqrt{(1-p)(1-q)}\right)^2 + \left\vert\sqrt{p(1-p)} - \sqrt{q(1-q)}\right\vert^2\\ &= 1 - \vert p-q\vert^2 = 1-d_{\mathrm{TV}}^2(P,Q).\end{align*}$$


<
Previous Post
Proper Scoring Rules
>
Next Post
Fano’s Method