Spatial Statistics (IV): Kriging

Kriging is a method of interpolation based on Gaussian process, and is also known as Gaussian process regression (GPR). Georges Matheron established the theoretic basis of this spatial interpolation technique in 1960, and named it kriging (French: krigeage) to honor the pioneering work of Danie G. Krige in geostatistics.

Given a random field $\lbrace Z(s): s\in\mathcal{D}\rbrace$ with its observation at location $s_1,\cdots,s_n,$ the goal of kriging is to predict $Z(\cdot)$ at an arbitrary location $s\in\mathcal{D}.$

Simple Kriging

In simple kriging, the mean of spatial field is known. Without loss of generality, set it to be zero.

Let $f:\mathcal{X}\to{\mathbb{R}}$ be the field of interest and $f\sim\mathcal{GP}\left(0,k(\cdot,\cdot)\right),$ we will interpolate the value of $f$ at an arbitrary location using its finite observations. Please note that $f$ is not a random field or a Gaussian process itself since it is deterministic. Instead, $f$ is a realization of the aforementioned Gaussian process. In the context of Bayesian regression, $\mathcal{GP}(0,k(\cdot,\cdot))$ is the prior distribution of $f$.

We first introduce some notations. Given locations $x_1,\cdots,x_N\in\mathcal{X},$ we use $\mathbf{K}$ to denote the covariance matrix of $\left(f(x_1),\cdots,f(x_N)\right)^\top,$ i.e., $\mathbf{K} = \lbrace k(x_i,x_j)\rbrace_{i,j=1}^N\in\mathbb{R}^{n\times n}$. We define $\mathbf{k}(\cdot): \mathcal{X}\to\mathbf{R}^N$ to be such that

$$\mathbf{k}(x) = \left(k(x,x_1),\cdots,k(x,x_N)\right)^\top.$$

Case 1. Noise-free observations

We suppose that our observations are faithful, that is, the observation at location $x_i$ is the true value of $f(x_i)$. If we have observations at $N$ distinct locations $x_1,\cdots,x_N\in\mathcal{X},$ then our dataset is $\left\lbrace\left(x_i,f(x_i)\right)\right\rbrace_{i=1}^N.$ Denote $\mathbf{f} = \left(f(x_1),\cdots,f(x_N)\right)^\top,$ the posterior of $f$ at any location is given by the following theorem.

Theorem 1. (Posterior of $f$ conditioning on noise-free observations). Under the notations above plus $\mathbf{K}$ is nonsingular, the posterior of $f$ at an arbitrary location $x\in\mathcal{X}$ is Gaussian. The mean and variance is given by

$$\begin{aligned}\mathbb{E}\left[f(x)\left|\mathbf{f}\right.\right] &= \mathbf{k}(x)^\top\mathbf{K}^{-1} \mathbf{f},\\ \mathrm{Var}\left\lbrace f(x)\left|\mathbf{f}\right.\right\rbrace &= k(x,x) - \mathbf{k}(x)^\top\mathbf{K}^{-1}\mathbf{k}(x). \end{aligned}$$

Moreover, $\forall x,x’\in\mathcal{X}$,

$$\mathrm{Cov}\left\lbrace f(x), f(x')\left|\mathbf{f}\right.\right\rbrace = k(x,x') - \mathbf{k}(x)^\top\mathbf{K}^{-1} \mathbf{k}(x').$$

Proof. $\forall x,x’\in\mathcal{X}$, the joint distribution of $\left(f(x),f(x’),f(x_1),\cdots,f(x_n)\right)^\top$ can be derived:

$$\begin{pmatrix} f(x) \\ f(x') \\ \mathbf{f}\end{pmatrix} \sim N\left\lbrace \mathbf{0},\begin{pmatrix} k(x,x) & k(x,x') & \mathbf{k}(x)^\top \\ k(x', x) & k(x',x') & \mathbf{k}(x')^\top \\ \mathbf{k}(x) & \mathbf{k}(x') & \mathbf{K} \end{pmatrix}\right\rbrace.$$

Then conditional on $\mathbf{f},$ the distribution of $(f(x),f(x’))^\top$ is still Gaussian:

$$\left.\begin{pmatrix} f(x) \\ f(x')\end{pmatrix}\right|\,\mathbf{f} \sim N\left\lbrace \begin{pmatrix} \mathbf{k}(x)^\top \\ \mathbf{k}(x')^\top \end{pmatrix}\mathbf{K}^{-1}\mathbf{f},\ \begin{pmatrix} k(x,x) & k(x,x') \\ k(x', x) & k(x',x') \end{pmatrix} - \begin{pmatrix} \mathbf{k}(x)^\top \\ \mathbf{k}(x')^\top \end{pmatrix}\mathbf{K}^{-1}\left(\mathbf{k}(x),\mathbf{k}(x')\right)\right\rbrace.$$

Remark.

When $x$ is some $x_i\in\lbrace x_1,\cdots, x_N\rbrace$, the posterior of $f(x_i)$ reduces to the Dirac measure centered on $\mathbf{f}_i.$ To see this, note that
$$\mathbf{e}_i = \mathbf{K}^{-1}\mathbf{Ke}_i = \mathbf{K}^{-1}\mathbf{k}(x_i),$$

then

$$\mathbb{E}\left[f(x_i)\left|\mathbf{f}\right.\right] = \mathbf{e}_i\mathbf{f} = f(x_i),\ \mathrm{Var}\left\lbrace f(x)\left|\mathbf{f}\right.\right\rbrace = k(x_i,x_i) - \mathbf{k}(x_i)^\top\mathbf{e}_i = 0.$$
Define $\mu^\perp(x) = \mathbf{f}^\top\mathbf{K}^{-1}\mathbf{k}(x),\ k^\perp(x,x’) = k(x,x’) - \mathbf{k}(x)^\top\mathbf{K}^{-1} \mathbf{k}(x’),$ the posterior of $f$ is a Gaussian process:
$$f|\mathbf{f} \sim\mathcal{GP}\left(\mu^\perp(\cdot),\ k^\perp(\cdot,\cdot)\right).$$

To show this you need to verify that $k^\perp(\cdot,\cdot)$ is a valid kernel, i.e., $k^\perp(\cdot,\cdot)$ is positive definite. You may use Schur’s complement to derive the form of $k^\perp(\cdot,\cdot).$

Case 2. Observations with additive noise

In a real scenario, the true process $f$ is not accessible due to random fluctuation or measurement error. Like in hierarchical model, assume that there exists additive noise in our observational data:

\[y_i = f(x_i) + \varepsilon_i,\ \varepsilon_i\overset{\mathrm{i.i.d.}}{\sim} N(0,\sigma^2),\]

then the posterior of $f$ at any location need to be modified.

Theorem 2. (Posterior of $f$ conditioning on observations with additive noise). Under the notations above with observations $Y=(y_1,\cdots,y_N)^\top,$ the posterior of $f$ at an arbitrary location $x\in\mathcal{X}$ is Gaussian. The mean and variance is given by

$$\begin{aligned}\mathbb{E}\left[f(x)\left|Y\right.\right] &= \mathbf{k}(x)^\top\left(\mathbf{K} + \sigma^2\mathbf{I}_{N}\right)^{-1} Y,\\ \mathrm{Var}\left\lbrace f(x)\left|Y\right.\right\rbrace &= k(x,x) - \mathbf{k}(x)^\top\left(\mathbf{K} + \sigma^2\mathbf{I}_{N}\right)^{-1}{\mathbf{k}}(x). \end{aligned}$$

Moreover, $\forall x,x’\in\mathcal{X}$,

$$\mathrm{Cov}\left\lbrace f(x), f(x')\left|Y\right.\right\rbrace = k(x,x') - \mathbf{k}(x)^\top\left(\mathbf{K} + \sigma^2\mathbf{I}_{N}\right)^{-1}{\mathbf{k}}(x').$$

Proof. Analogous to Theorem 1, use the joint distribution of $\left(f(x),f(x’),Y\right)$ to derive the conditional (posterior) distribution.

Remark.

The posterior of $f$ is a Gaussian process with the mean and covariance function given by Theorem 2. A point estimator is the expectation:
$$\hat{f}(\cdot) = Y^\top(\mathbf{K}+\sigma^2\mathbf{I}_N)^{-1}\mathbf{k}(\cdot),$$

which is a linear combination of $\lbrace k(\cdot,x_i)\rbrace_{i=1}^N.$

Case 3. Overlapping observations

In previous discussions, locations $x_1,\cdots,x_N$ are supposed to be distinct. In other words, we have only one observation for a unique location.

Now we suppose that we have $N$ observations at $n$ unique locations $\bar{x}_ 1,\cdots,\bar{x}_ n$ with $n < N.$ At location $\bar{x}_ j$, we have $a_ j$ observations $y_ {j,1},\cdots,y_ {j,a_ j},$ with $j=1,\cdots,n,\ a_1+\cdots+a_n = N$ and $\max_j a_j > 1$.

We denote

$$(x_1,\cdots, x_N) = (\underbrace{\bar{x}_1,\cdots,\bar{x}_1}_{a_1},\ \underbrace{\bar{x}_2,\cdots,\bar{x}_2}_{a_2},\cdots,\underbrace{\bar{x}_n,\cdots,\bar{x}_n}_{a_n}) \in\mathcal{X}^N,$$

$$Y = (y_{1,1},\cdots,y_{1,a_1},y_{2,1},\cdots,y_{2,a_2},\cdots,y_{n,1},\cdots,y_{n,a_n})\in\mathbb{R}^N,$$

$$\bar{Y} = (\bar{y}_1,\cdots,\bar{y}_n)\in\mathbb{R}^n,\ \ \bar{y}_j = \frac{1}{a_j}\sum_{l=1}^{a_j}y_{j,a_j}.$$

Moreover, denote

$$\bar{\mathbf{k}}(x) = \left(k(x,\bar{x}_1),\cdots,k(x,\bar{x}_n)\right)^\top\in\mathbb{R}^n,\ \ \bar{\mathbf{K}} = \lbrace k(\bar{x}_i,\bar{x}_j)\rbrace_{i,j=1}^n\in\mathbb{R}^{n\times n}.$$

Let $\mathbf{U}\in\mathbb{R}^{N\times n}$ be a block diagonal matrix such that $\mathbf{U}=\mathrm{diag}\lbrace \mathbf{1}_ {a_ 1},\cdots,\mathbf{1}_ {a_ n}\rbrace$, where $\mathbf{1}_ {a_j}$ is an $a_j$-dimensional all-one vector, and let $\mathbf{A} = \mathrm{diag}\lbrace a_1,\cdots,a_n\rbrace.$

Proposition 1. With the notations above, we have $\bar{Y} = \mathbf{A}^{-1}\mathbf{U}^\top Y,\ \bar{\mathbf{k}}(x) = \mathbf{A}^{-1}\mathbf{U}^\top\mathbf{k}(x),\ \mathbf{k}(x) = \mathbf{U}\bar{\mathbf{k}}(x),\ \mathbf{K} = \mathbf{U}\bar{\mathbf{K}}\mathbf{U}^\top.$

These equalities can be verified with simple calculation.

Proposition 2. Suppose $\sigma^2 > 0,$ then $\mathbf{U}^\top\left(\sigma^2\mathbf{I}_N + \mathbf{K}\right)^{-1}\mathbf{U} = \left(\sigma^2\mathbf{A}^{-1} + \bar{\mathrm{K}}\right)^{-1}.$

Proof. Recall the Sherman-Morrison-Woodbury formula:

\[(\mathbf{A} + \mathbf{UCV})^{-1} = \mathbf{A}^{-1} - \mathbf{A}^{-1}\mathbf{U}\left(\mathbf{C}^{-1} + \mathrm{V}\mathbf{A}^{-1}\mathrm{U}\right)^{-1}\mathbf{V}\mathbf{A}^{-1}.\]

We have

$$\left(\sigma^2\mathbf{I}_N + \mathbf{K}\right)^{-1} = \left(\sigma^2\mathbf{I}_N + \mathbf{U}\bar{\mathbf{K}}\mathbf{U}^\top\right)^{-1} = \sigma^{-2}\mathbf{I}_N - \sigma^{-4}\mathrm{U}\left(\bar{\mathbf{K}}^{-1} + \sigma^{-2}\mathbf{U}^\top\mathbf{U}\right)^{-1}\mathbf{U}^\top;$$

Note that $\mathbf{U}^\top\mathbf{U} = \mathbf{A}$, then

$$\mathbf{U}^\top\left(\sigma^2\mathbf{I}_N + \mathbf{K}\right)^{-1}\mathbf{U} = \sigma^{-2}\mathbf{A} - \sigma^{-4}\mathbf{A}\left(\bar{\mathbf{K}}^{-1} + \sigma^{-2}\mathbf{A}\right)^{-1}\mathbf{A},$$

Using the Sherman-Morrison-Woodbury formula, we have

$$\sigma^{-2}\mathbf{A} + \sigma^{-4}\mathbf{A}\left(\bar{\mathbf{K}}^{-1} + \sigma^{-2}\mathbf{A}\right)^{-1}\mathbf{A} = \left(\sigma^2\mathbf{A}^{-1} + \bar{\mathbf{K}}\right)^{-1},$$

which concludes the proof.

Theorem 3. (Posterior of $f$ conditioning on overlapping observations). Under the notations above, the posterior of $f$ at an arbitrary location $x\in\mathcal{X}$ is Gaussian. The mean and variance is given by

$$\begin{aligned}\mathbb{E}\left[f(x)\left|\bar{Y}\right.\right] &= \bar{\mathbf{k}}(x)^\top\left(\bar{\mathbf{K}} + \sigma^2\mathbf{A}^{-1}\right)^{-1}\bar{Y},\\ \mathrm{Var}\left\lbrace f(x)\left|\bar{Y}\right.\right\rbrace &= k(x,x) - \bar{\mathbf{k}}(x)^\top\left(\bar{\mathbf{K}} + \sigma^2\mathbf{A}^{-1}\right)^{-1}\bar{\mathbf{k}}(x). \end{aligned}$$

Moreover, $\forall x,x’\in\mathcal{X}$,

$$\mathrm{Cov}\left\lbrace f(x), f(x')\left|Y\right.\right\rbrace = k(x,x') - \bar{\mathbf{k}}(x)^\top\left(\bar{\mathbf{K}} + \sigma^2\mathbf{A}^{-1}\right)^{-1}\bar{\mathbf{k}}(x').$$

Proof. Plug in Proposition 1 and 2 to the posterior of $f$ given in Theorem 2.

Universal Kriging

In universal kriging, the mean of spatial field is unknown. To apply regression, we choose a basis $\lbrace m_1(\cdot),\cdots,m_p(\cdot)\rbrace$ from $L_2(\mathcal{X})$ and denote $\mathbf{m}(\cdot) = \left(m_1(\cdot),\cdots,m_p(\cdot)\right)^\top:\mathcal{X}\to\mathbb{R}^p.$

Assume that $f\sim\mathcal{GP}\left(\mathbf{m}(\cdot)^\top\beta,\ k(\cdot,\cdot)\right),$ where $\beta\in\mathbb{R}^p$ is unknown, and that our observations $\left.Y(x)\right\vert f\sim N(f(x),\sigma^2).$ Note these assumptions implies the independence of $\left\lbrace \left.Y(x)\right\vert x\in\mathcal{X}\right\rbrace$ given $f.$ Then, we can define the error process

$$\epsilon(x) := Y(x) - \mathbf{m}(x)^\top\beta,$$

which implies the following distribution of observations at $x_1,\cdots,x_n$:

$$\epsilon(x_{1:n})\sim N(0,\Sigma),\ \ \Sigma = \mathbf{K} + \sigma^2\mathbf{I}_n,\ \ \mathbf{K} = \lbrace k(x_i,x_j)\rbrace_{i,j=1}^n.$$

Furthermore, the joint distribution of our observations can be induced:

$$\mathbf{Y} := \left(Y(x_1),\cdots,Y(x_n)\right)^\top\sim N\left(\mathbf{M}\beta,\Sigma\right),\ \ \mathbf{M} = \left(\mathbf{m}(x_1),\cdots,\mathbf{m}(x_n)\right)^\top\in\mathbb{R}^{n\times p}.$$

Now we derive the best unbiased linear prediction (BLUP) of $Y(x)$ for arbitrary $x\in\mathcal{X}.$ Generally, we have the following result.

Theorem 4 (Universal kriging). Suppose the covariance matrix of $\mathbf{Y}=\left(Y(x_1),\cdots,Y(x_n)\right)$ is nonsingular, and $\mathbf{M}=\left(\mathbf{m}(x_1),\cdots,\mathbf{m}(x_n)\right)^\top$ is of full column rank. Then for $x\in\mathcal{X},$ the BLUP of $Y(x)$ is given by

$$\hat{Y}(x) = \mathbf{m}(x)^\top\hat{\beta} + \mathbf{k}(x)^\top\Sigma^{-1}\left(\mathbf{Y} - \mathbf{M}\hat{\beta}\right),$$

where $\hat{\beta} = \left(\mathbf{M}^\top\Sigma^{-1}\mathbf{M}\right)^{-1}\mathbf{M}^\top\Sigma^{-1}\mathbf{Y}.$

Proof. The derivation of BLUP is straight forward but tedious. We first express the prediction of $Y(x)$ as a linear combination:

$$\hat{Y}(x) = \lambda_0(x) + \sum_{i=1}^n \lambda_i(x) Y(x_i) = \lambda_0(x) + \lambda(x)^\top\mathbf{Y}.$$

The unbiasedness of BLUP implies that

$$\mathbf{m}(x)^\top\beta = \mathbb{E}\left[\hat{Y}(x)\right] = \lambda_0(x) + \lambda(x)^\top\mathbf{M}\beta\ \ \ \forall\beta\in\mathbb{R}^p;$$

set $\beta=0,$ we have $\lambda_0(x)=0.$

To solve the BLUP, which has the minimum variance across all predictions of the form above, one need to consider the following optimization problem:

$$\min_{\lambda(x)\in\mathbb{R}^p}\ \ \mathbb{E}\left[\left(Y(x) - \lambda(x)^\top\mathbf{Y}\right)^2\right]\ \ \ \textrm{subjected to}\ \ \ \mathbf{M}^\top\lambda(x) = \mathbf{m}(x).$$

The mean square error of BLUP is

$$\mathbb{E}\left[\left(Y(x) - \lambda(x)^\top\mathbf{Y}\right)^2\right] = k(x,x) - 2\mathbf{k}(x)^\top\lambda(x) + \lambda(x)^\top\Sigma\lambda(x),$$

then we can construct the Lagrangian function as:

$$L(\lambda(x),\nu(x)) = \lambda(x)^\top\Sigma\lambda(x) - 2\mathbf{k}(x)^\top\lambda(x) + 2\nu(x)^\top\left(\mathbf{m}(x) - \mathbf{M}^\top\lambda(x)\right).$$

Apply Karush-Kuhn-Tucker conditions, the optimal solutions of primal and dual problems satisfy:

$$\begin{cases} \frac{\partial L}{\partial\lambda^*(x)} = 2\Sigma\lambda^*(x) - 2\mathbf{k}(x) - 2\mathbf{M}\nu^*(x) = 0\\ \frac{\partial L}{\partial\nu^*(x)} = \mathbf{M}^\top\lambda^*(x) - \mathbf{m}(x) = 0, \end{cases}$$

and it can be solved that

$$\begin{cases} \nu^*(x) = \left(\mathbf{M}^\top\Sigma^{-1}\mathbf{M}\right)^{-1}\left(\mathbf{m}(x) - \mathbf{M}^\top\Sigma^{-1}\mathbf{k}(x)\right)\\ \lambda^*(x) = \Sigma^{-1}\mathbf{k}(x) + \Sigma^{-1}\mathbf{M}\left(\mathbf{M}^\top\Sigma^{-1}\mathbf{M}\right)^{-1}\left(\mathbf{m}(x) - \mathbf{M}^\top\Sigma^{-1}\mathbf{k}(x)\right). \end{cases}$$

Let $\hat{\beta} = \left(\mathbf{M}^\top\Sigma^{-1}\mathbf{M}\right)^{-1}\mathbf{M}^\top\Sigma^{-1}\mathbf{Y},$ then

$$\hat{Y}(x) = \lambda^*(x)^\top\mathbf{Y} = \mathbf{m}(x)^\top\hat{\beta} + \mathbf{k}(x)^\top\Sigma^{-1}\left(\mathbf{Y} - \mathbf{M}\hat{\beta}\right).$$

Properties.

$\mathbb{E}\hat{\beta} = \beta,\ \mathrm{Cov}(\hat{\beta}) = \left(\mathbf{M}^\top\Sigma^{-1}\mathbf{M}\right)^{-1}.$
For any $x,x’\in\mathcal{X},$ $\mathbb{E}\hat{Y}(x) = \mathbf{m}(x)^\top\beta,$
$\mathrm{Cov}\left\lbrace\hat{Y}(x), \hat{Y}(x’)\right\rbrace = \left(\mathbf{m}(x)^\top - \mathbf{k}(x)^\top\Sigma^{-1}\mathbf{M}\right)\left(\mathbf{M}^\top\Sigma^{-1}\mathbf{M}\right)^{-1}\left(\mathbf{m}(x) - \mathbf{M}^\top\Sigma^{-1}\mathbf{k}(x)\right),\ x,x’\in\mathcal{X}.$

Ordinary Kriging

In ordinary kriging, the spatial field is supposed to have a constant mean, which is unknown.

Ordinary kriging can be handled as a special case of universal kriging where the basis $\mathbf{m}(\cdot)\equiv 1$ is one-dimensional. Then, inherited from the discussion in universal kriging, the BLUP of process $Y(\cdot)$ at an arbitrary location $x\in\mathcal{X}$ is given by

$$\hat{Y}(x) = \hat{\beta} + \mathbf{k}(x)^\top\Sigma^{-1}(\mathbf{Y} - \beta\mathbf{1}_n),\ \textsf{where}\ \ \hat{\beta} = \frac{\mathbf{1}_n^\top \Sigma^{-1}\mathbf{Y}}{\mathbf{1}_n^\top \Sigma^{-1}\mathbf{1}_n}.$$

Here $\mathbf{1}_n$ is the $n$-dimensional all-one vector.

Spatial Statistics (III): Gaussian Process

Spatial Statistics (V): Reproducing Kernel Hilbert Space