Proper Scoring Rules

We concentrate on the problem of probabilistic forecasts on a non-empty set $\Omega.$ Let $\mathcal{A}$ be a $\sigma$-algebra pf subsets of $\Omega,$ and let $\mathcal{P}$ be the convex class of probability measures on $(\Omega,\mathcal{A}).$ A function $f:\Omega\to\overline{\mathbb{R}}$ is said to be $\mathcal{P}$-quasi-integrable if it is $\mathcal{A}$-measurable and is quasi-integrable with respect to all $P\in\mathcal{P}.$ Any $P\in\mathcal{P}$ is called a probabilistic forecast.

A scoring rule is any extended real-valued function $S:\mathcal{P}\times\Omega\to\overline{\mathbb{R}}$ such that $S(P,\cdot)$ is $\mathcal{P}$-quasi-integrable for all $P\in\mathcal{P}.$ That is, the forecaster’s reward is $S(P,\omega)$ with a realization of probabilistic forecast $P$ and event $\omega.$

Convexity and Propriety

Propriety. Under a probability measure $Q,$ the expected score of forecast $P$ can be written as

$$S(P,Q)=\int S(P,\omega)\mathrm{d}Q(\omega).$$

Intuitively, we prefer a scoring rule that gains the highest score when the forecast $P$ is consistent with the real measure $Q.$ Therefore, the scoring rule is said to be proper relative to $\mathcal{P}$ if

$$S(Q,Q)\leq S(P,Q)\ \ \forall P,Q\in\mathcal{P}.$$

Moreover, $S$ is strictly proper if the equality holds if and only if $P=Q.$

Convexity. A function $G:\mathcal{P}\to\mathbb{R}$ is convex if

$$G((1-\lambda)P_0 + \lambda P_1)\leq (1-\lambda)G(P_0) + \lambda G(P_1)\ \ \forall \lambda\in(0,1),\ P_0,P_1\in\mathcal{P},$$

and $G$ is strictly convex if the equality holds if and only if $P_ 0=P_ 1.$

Analogous to the concept of subgradient, a function $G^*(P,\cdot):\Omega\to\mathbb{R}$ is a subtangent of $G$ at point $P\in\mathcal{P}$ if it is integrable with respect to $P$ and quasi-integrable with respect to all $Q\in\mathcal{P},$ and satisfies

$$G(Q)\geq G(P) + \int G^*(P,\omega)\mathrm{d}[Q-P](\omega)$$

for all $Q\in\mathcal{P}.$

Characterization of proper scoring rules. The connection between convexity and propriety is established as follows.

Definition 1 (Regularity). A scoring rule $S:\mathcal{P}\times\Omega\to\overline{\mathbb{R}}$ is regular relative to $\mathcal{P}$ if $S(P,Q)$ is real-valued for all $P,Q\in\mathcal{P}$ except possibly $S(P,Q)=-\infty$ for some $P\neq Q.$

Theorem 1. A regular scoring rule $S:\mathcal{P}\times\Omega\to\overline{\mathbb{R}}$ is proper relative to $\mathcal{P}$ if and only if there exists a convex, real-valued function $G:\mathcal{P}\to\mathbb{R}$ such that

$$S(P,\omega) = G(P) - \int G^*(P,\omega)\mathrm{d}P(\omega) + G^*(P,\omega)$$

for $P\in\mathcal{P}$ and $\omega\in\Omega,$ where $G(P,\cdot):\Omega\to\overline{\mathbb{R}}$ is a subtangent of $G$ at point $P\in\mathcal{P}.$

Proof. Sufficiency: With the stated form of $S,$ the propriety $S(P,Q)\leq S(Q,Q)$ holds by the definition of subtangent.

Necessity: Suppose that $S$ is a regular proper scoring rule. Define $G:\mathcal{P}\to\mathbb{R}$ by $G(P) = S(P,P) = \sup_ {Q\in\mathcal{P}} S(Q,P).$ Since $S(Q,\cdot)$ is convex (and concave), $G$ as a pointwise supremum over a class of convex functions, is thus convex. Then, it can be verified that $S(P,\cdot):\Omega\to\overline{\mathbb{R}}$ is a subtangent of $G$ at the point $P.$ Furthermore, $S$ has the stated representation with $G^*(P,\omega) = S(P,\omega)$ plugged in.

Remark. An alternative statement of Theorem 1 is that a regular scoring rule $S$ is proper relative to the class $\mathcal{P}$ if and only if the expected score function $G(P) = S(P,P)$ is convex and $S(P,\omega)$ is a subtangent of $G$ at point $P$ for all $P\in\mathcal{P}.$ Moreover, strict propriety and strict convexity imply each other.

Information Measures, Bregman Divergences and Decision Theory

Suppose that the scoring rule $S$ is proper relative to class $\mathcal{P},$ the expected score function

$$G(P) = \sup_{Q\in\mathcal{P}} S(Q,P),\ \ P\in\mathcal{P},$$

is referred to as the information measure or generalized entropy function associated with $S.$

Suppose that the scoring rule $S$ is regular and proper, then the divergence function associated with $S$ is defined as

$$d(P,Q) = S(Q,Q) - S(P,Q),\ \ P,Q\in\mathcal{P}.$$

The divergence is nonnegative, and if $S$ is strictly proper, then $d(P,Q)$ is strictly positive for $P\neq Q.$ If the sample space is finite and the entropy function is sufficiently smooth, then the divergence function becomes the Bregman divergence (Bregman 1967) associated with the convex function $G:$

$$d(P,Q) = G(Q) - G(P) - \int G^*(P,\omega)\mathrm{d}[Q-P](\omega).$$

Proper scores occur naturally in statistical decision problems. Given an outcome space and an action space, let $U(\omega,a)$ be the utility of outcome $\omega$ and action $a,$ and let $\mathcal{P}$ be a convex class of probability measures on the outcome space. Let $a_P$ denote the Bayes act for $P\in\mathcal{P}.$ Then the scoring rule

$$S(P,\omega)=U(\omega,a_P)$$

is proper relative to class $\mathcal{P}.$ Indeed,

$$S(Q,Q) = \int U(\omega,a_Q)\mathrm{d}Q(\omega) \geq \int U(\omega,a_P)\mathrm{d}Q(\omega) = S(P,Q)$$

by the fact that Bayesian decision maximizes expected utility.

Scoring Rules for Categorical Variables

Savage representation

In this section we investigate the probabilistic forecasts of categorical variables, where the sample space $\Omega=\lbrace 1,\cdots,m\rbrace$ consists of a finite number $m$ of mutually exclusive events, and a probabilistic forecast is a probability simplex $\mathbf{p}=(p_1,\cdots,p_m).$ Then the convex class $\mathcal{P}$ becomes

$$\mathcal{P}_m = \{\mathbf{p}=(p_1,\cdots,p_m):p_1,\cdots,p_m\geq 0, p_1+\cdots+p_m=1\}.$$

A scoring rule $S$ can be identified with a collection of $m$ functions:

$$S(\cdot,i):\mathcal{P}_m\to\overline{\mathbb{R}},\ \ i=1,\cdots,m.$$

Let $G:\mathcal{P}_m\to\mathbb{R}$ be a convex function, then a unique subgradient of $G$ at point $\mathbf{p}$ is

$$\nabla G(\mathbf{p}) = \left(\frac{\partial}{\partial p_1}G(\mathbf{p}),\cdots,\frac{\partial}{\partial p_m}G(\mathbf{p})\right)^\top,$$

if $G$ is differentiable and $\mathbf{p}$ is an interior point of $\mathcal{P}_ m.$ Generally, denote a subgradient of $G$ at $\mathbf{p}\in\mathcal{P}_ m$ by $\mathbf{G}^\prime(\mathbf{p}) = (G_ 1^\prime(\mathbf{p}),\cdots,G_ m^\prime(\mathbf{p}))^\top:$

$$G(\mathbf{q})\geq G(\mathbf{p}) + \langle \mathbf{G}^\prime(\mathbf{p}),\mathbf{q}-\mathbf{p}\rangle\ \ \forall\mathbf{q}\in\mathcal{P}_m.$$

The following statement is a special case of Theorem 1.

Definition 2 (Regularity). A scoring rule $S$ for categorical forecasts is regular if $S(\cdot,i)$ is real-valued for $i=1,\cdots,m,$ except possibly that $S(\mathbf{p},i)=-\infty$ if $p_i=0.$

Theorem 2 (McCarthy, Savage). A regular scoring rule $S$ for categorical forecasts is (strictly) proper if and only if

$$S(\mathbf{p},i) = G(\mathbf{p}) - \langle\mathbf{G}^\prime(\mathbf{p}),\mathbf{p}\rangle + G_i^\prime(\mathbf{p})\ \ \text{for}\ i=1,\cdots,m,$$

where $G:\mathcal{P}_ m\to\mathbb{R}$ is a (strictly) convex function and $\mathbf{G}^\prime(\mathbf{p})$ is a subgradient of $G$ at point $\mathbf{p}$ for all $\mathbf{p}\in\mathcal{P}_m.$

Logarithmic score, Negative Shannon entropy and Kullback-Leibler divergence

By employing the logarithmic score $S(\mathbf{p},i) = \log p_i,$ we can obtain the negative Shannon entropy $G(\mathbf{p}) = -\sum_ {i=1}^m p_i\log p_i.$ Correspondingly, the associated Bregman divergence

$$d(\mathbf{p},\mathbf{q}) = S(\mathbf{q},\mathbf{q}) - S(\mathbf{p},\mathbf{q}) = \sum_{i=1}^m q_i\log\frac{q_i}{p_i}$$

becomes the Kullback-Leibler divergence. In this case $S$ is strictly proper and $G$ is strictly convex.

Schervish representation

In this section we investigate the probabilistic forecasts of dichotomous variables, where the sample space $\Omega=\lbrace 0,1\rbrace,$ and a probabilistic forecast is a quoted probability $p\in[0,1]$ for the event to occur. A scoring rule $S$ is regular if $S(\cdot,1)$ and $S(\cdot,0)$ are real-valued, except possibly $S(0,1)=-\infty$ and $S(1,0)=-\infty.$ A variant of Theorem 2 suggests that every regular proper scoring rule is of the form:

$$\begin{align*} S(p,1) &= G(p) + (1-p)G^\prime(p),\\ S(p,0) &= G(p) - pG^\prime(p), \end{align*}$$

where $G$ is a convex function and $G^\prime(p)$ is a subgradient of $G$ at any point $p\in[0,1]$ in the sense that

$$G(q)\geq G(p) + (q-p)G^\prime(p)\ \ \forall q\in[0,1].$$

Theorem 3 (Schervish). Suppose that $S$ is a regular scoring rule. Then $S$ is proper and such that $S(0, 1) = \lim_{p\to 0} S(p, 1),$ and $S(0, 0) = \lim_{p\to 0} S(p, 0),$ and both $S(p, 1)$ and $S(p, 0)$ are left continuous if and only if there exists a nonnegative measure $\nu$ on $(0, 1)$ such that

$$\begin{align*} S(p,1) &= S(1,1) - \int (1-c)\mathbb{1}\lbrace p\leq c\rbrace\nu(\mathrm{d}c),\\ S(p,0) &= S(0,0) - \int c\mathbb{1}\lbrace p> c\rbrace\nu(\mathrm{d}c), \end{align*}$$

for all $p\in[0,1].$ The scoring rule is strictly proper if and only if $\nu$ assigns positive measure to every open interval.

Scoring Rules for Continuous Variables

Density forecasts

Let $\mu$ be a $\sigma$-finite measure on $(\Omega,\mathcal{A}),$ for example, the Lebesgue measure. For $\alpha>1,$ denote by $\mathcal{L}_ \alpha$ the class of probability measures on $(\Omega,\mathcal{A})$ that are absolutely continuous with respect to $\mu$ and have $\mu$-density $p$ such that

$$\Vert p\Vert_\alpha = \left(\int p(\omega)^\alpha\mu(\mathrm{d}\omega)\right)^{1/\alpha}.$$

Then we identify a probabilistic forecast $P\in\mathcal{L}_ \alpha$ with its $\mu$-density $p$ in the sense that

$$p(\omega) = \lim_{\rho\to 0}\frac{P(B_\rho(\omega))}{\mu(B_\rho(\omega))},\ \ \omega\in\Omega,$$

where $B$ is a ball of radius $\rho$ centered at $\omega.$

Quadratic score. Let $S(p,\omega) = 2p(\omega) - \Vert p\Vert_ 2^2.$ Then the generalized entropy is $G(p)=\Vert p\Vert_2^2,$ and the associated divergence is $d(p,q)=\Vert p - q\Vert_2^2.$

Logarithmic score. Let $S(p,\omega) = \log p(\omega).$ Then the generalized entropy is $H(p) = \int p(\omega)\log p(\omega)\mu(\mathrm{d}\omega),$ and the associated divergence is $d(p,q) = D_{\mathrm{KL}}(q\Vert p) = \int q(\omega)\log\frac{q(\omega)}{p(\omega)}\mu(\mathrm{d}\omega).$

Continuous ranked probability score

Denote by $\mathcal{P}$ the Borel probability measures on $\mathbb{R}.$ We identify a probability forecast from $\mathcal{P}$ with its cumulative distribution $F.$ Then the continuous ranked probability score (CRPS) is defined as

$$\mathrm{CRPS}(F,x) = -\int_{-\infty}^{\infty}\left(F(y)-\mathbb{1}\lbrace y\geq x\rbrace\right)^2\mathrm{d}y,$$

which can be viewed as the negative half energy distance between $F$ and $\delta_x$:

$$\mathrm{CRPS}(F,x) = \frac{1}{2}\mathbb{E}\vert X-X'\vert - \mathbb{E}\vert X-x\vert,$$

where $X,X^\prime\overset{\mathrm{i.i.d.}}{\sim} F.$

The associated information score is

$$G(F) = -\int F(y)(1-F(y))\mathrm{d}y = -\frac{1}{2}\mathbb{E}\vert X-X'\vert,$$

and the associated divergence is

$$d(F,G) = \int (F(y)-G(y))^2\mathrm{d}y.$$

Energy score

Let $\mathcal{P}_ \beta,\beta\in(0,2)$ denote the class of the Borel probability measures on $\mathbb{R}^m$ such that $\mathbb{E}\Vert\mathbf{X}\Vert_2^\beta$ is finite. The energy score is defined as a generalization of the CRPS:

$$\mathrm{ES}(P,\mathbf{x}) = \frac{1}{2}\mathbb{E}\Vert\mathbf{X}-\mathbf{X}^\prime\Vert_2^\beta - \mathbb{E}\Vert\mathbf{X}-\mathbf{x}\Vert_2^\beta,$$

where $\mathbf{X},\mathbf{X}^\prime\overset{\mathrm{i.i.d.}}{\sim} P.$ It reduces to the CRPS when $m=1$ and $\beta=1.$

When $\beta=2,$ the energy score reduces to the negative squared error:

$$\begin{align*} &\frac{1}{2}\mathbb{E}\Vert \mathbf{X}-\mathbf{X}^\prime\Vert_2^2 - \mathbb{E}\Vert\mathbf{X}-\mathbf{x}\Vert_2^2\\ &= \left(\mathbb{E}\Vert\mathbf{X}\Vert_2^2 - \mathbb{E}\langle\mathbf{X},\mathbf{X}^\prime\rangle\right) - \left(\mathbb{E}\Vert\mathbf{X}\Vert_2^2 - 2\mathbb{E}\langle\mathbf{X},\mathbf{x}\rangle + \Vert\mathbf{x}\Vert_2^2\right)\\ &= \Vert\boldsymbol{\mu}_P\Vert_2^2 - 2\langle\boldsymbol{\mu}_P,\mathbf{x}\rangle + \Vert\mathbf{x}\Vert_2^2\\ &= \Vert \boldsymbol{\mu}_P - \mathbf{x}\Vert_2^2, \end{align*}$$

where $\boldsymbol{\mu}_P$ is the mean vector of $P.$

The energy score with index $\beta\in(0,2)$ also admits the following representation:

$$\mathrm{ES}(P,\mathbf{x}) = -\frac{\beta 2^{\beta-2}\Gamma\left(\frac{m+\beta}{2}\right)}{\pi^{m/2}\Gamma\left(1-\frac{\beta}{2}\right)}\int_{\mathbb{R}^m}\frac{\left\vert\phi_P(\mathbf{y}) - \exp(\mathrm{i}\langle\mathbf{x},\mathbf{y}\rangle)\right\vert^2}{\Vert\mathbf{y}\Vert^{m+\beta}_2}\mathrm{d}\mathbf{y},$$

where $\phi_P$ denotes the characteristic function of $P.$

References

Tilmann Gneiting & Adrian E Raftery (2007) Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102:477, 359-378.

Support Points

F-divergences