Spatial Statistics (VI): Kernel Ridge Regression

Nonparametric Regression

Penalized least squares

Suppose that we have a dataset $\lbrace (x_i,y_i)\rbrace_{i=1}^n$ which contains $n$ observations of a covariate $x\in\mathcal{X}$ and a response $y\in\mathbb{R}.$ A nonparametric regression model assumes that

\[y_i = f(x_i) + \varepsilon_i,\ i=1,\cdots,n,\tag{*}\]

where $\lbrace\varepsilon_i\rbrace_{i=1}^n$ are iid error terms with mean zero.

Assume that $\mathcal{H}$ is a RKHS with symmetric and positive definite $k(\cdot,\cdot):\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ as its reproducing kernel, and $f\in\mathcal{H}.$ We estimate the optimal $f^*$ as the solution to the penalized least squares:

$$f^* \in \underset{f\in\mathcal{H}}{\mathrm{argmin}}\ \frac{1}{n}\sum_{i=1}^n\left(y_i - f(x_i)\right)^2 + J(f),$$

where $J:\mathcal{H}\to\mathbb{R}$ is some differentiable regularization function. The first term (mean square error) measures the goodness-of-fit, and the second penalizes on the complexity of fitted function. Note that $\mathcal{H}$ is usually infinite dimensional, we can find some $f$ which perfectly matches each data point $(x_i,y_i)$. However, such $f$ often suffers from bad generalizability, which is the reason that we add a penalty term to the objective function, like in ridge regression.

Generalized loss

We continue to discuss the nonparametric regression model given by $(*)$. Suppose that $f$ is realizable in a RKHS $\mathcal{H}$ with reproducing kernel $k(\cdot,\cdot)$. In the RKHS framework, let the regularization term be $R(f)=g(\Vert f\Vert_\mathcal{H}),$ where $g:[0,\infty)\to\mathbb{R}$ is some strictly monotonically increasing function.

In penalized least squares, we use mean square error as the criterion to evaluate the goodness-of-fit. This can be relaxed by defining an arbitrary loss function $L:(\mathcal{X}\times\mathbb{R}^2)^n \to \mathbb{R}\cup\lbrace\infty\rbrace.$ Furthermore, we define the optimal $f^*$ as the minimizer of the regularized empirical risk functional:

$$f^* \in \underset{f\in\mathcal{H}}{\mathrm{argmin}}\ L\left\lbrace\left(x_1,y_1,f(x_1)\right),\cdots,\left(x_n,y_n,f(x_n)\right)\right\rbrace + g(\Vert f\Vert_\mathcal{H}).$$

Here $L$ can be the mean square error:

$$L\left\lbrace\left(x_1,y_1,f(x_1)\right),\cdots,\left(x_n,y_n,f(x_n)\right)\right\rbrace = \frac{1}{n}\sum_{i=1}^n\lbrace y_i - f(x_i)\rbrace^2,$$

Another example of the loss function is the hinge loss:

$$L\left\lbrace\left(x_1,y_1,f(x_1)\right),\cdots,\left(x_n,y_n,f(x_n)\right)\right\rbrace = \frac{1}{n}\sum_{i=1}^n\max\lbrace 1-y_if(x_i), 0\rbrace,$$

which is commonly used in support vector machine (SVM).

Recall the Mercer’s theorem, if $k(\cdot,\cdot)$ has the expansion $k(x,y)=\sum_{l\in\mathbb{N}^*}\lambda_l\phi_l(x)\phi_l(y),$ then $f$ has the following representation

$$f(\cdot)=\sum_{l=1}^\infty f_l\phi(\cdot),$$

then we rewrite the minimization problem as

$$\underset{\lbrace f_l\rbrace_{l\geq 1}}{\min}\ L\left\lbrace\left(x_1,y_1,\sum_{l=1}^\infty f_l\phi(x_1)\right),\cdots,\left(x_n,y_n,\sum_{l=1}^\infty f_l\phi(x_n)\right)\right\rbrace + g\left(\sqrt{\sum_{l=1}^\infty \frac{f_l^2}{\lambda_l}}\right).$$

As we can see, the optimization problem has an infinite dimensional form, which can be very knotty. The solution is given by the following representer theorem.

Representer Theorem

Theorem (Schölkopf-Herbrich-Smola). Suppose we are given a nonempty set $\mathcal{X}$, an RKHS $\mathcal{H}$ with reproducing kernel $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R},$ a finite training dataset $\lbrace(x_i, y_i)\rbrace_{i=1}^n\subset\mathcal{X}\times\mathbb{R},$ a strictly monotonically increasing function $g:[0,\infty)\to\mathbb{R}$ and an arbitrary loss function $L:(\mathcal{X}\times\mathbb{R}^2)^n \to \mathbb{R}\cup\lbrace\infty\rbrace.$ Then any minimizer $f^*\in\mathcal{H}$ of the regularized empirical risk functional

$$\mathcal{R}[f] := L\left\lbrace\left(x_1,y_1,f(x_1)\right),\cdots,\left(x_n,y_n,f(x_n)\right)\right\rbrace + g(\Vert f\Vert_\mathcal{H})$$

admits a representation of the form

$$f^*(\cdot) = \sum_{i=1}^n\alpha_i k(\cdot, x_i),\ \ \alpha_1,\cdots,\alpha_n\in\mathbb{R}.$$

Proof. Define the subspace of $\mathcal{H}$:

$$\mathcal{H}_1 = \mathrm{span}\left\lbrace k(\cdot,x_i): i=1,\cdots,n\right\rbrace := \left\lbrace \sum_{i=1}^n\alpha_i k(\cdot,x_i): \alpha_1,\cdots,\alpha_n\in\mathbb{R}\right\rbrace,$$

and let $\mathcal{H}=\mathcal{H}_1\oplus\mathcal{H}_2.$ Then, any $f\in\mathcal{H}$ can be decomposed to two orthogonal components:

$$f=\hat{f} + \delta,\ \hat{f}\in\mathcal{H}_1,\ \delta\in\mathcal{H}_2.$$

Since $\delta$ is orthogonal to $\mathcal{H}_1,$ and using the reproducing property of $\mathcal{H},$ we have

\[\delta(x_i) = \langle\delta(\cdot), k(\cdot,x_i)\rangle_\mathcal{H} = 0,\ \ i=1,\cdots,n.\]

Hence $f(x_i)=\hat{f}(x_i)$ for $i=1,\cdots,n,$ and the first term of $\mathcal{R}$ does not change between $f$ and $\hat{f}.$

For the second term, it implies by orthogonality that

$$g(\Vert f\Vert_\mathcal{H}) = g\left(\sqrt{\Vert \hat{f}\Vert_\mathcal{H}^2 + \Vert\delta\Vert_\mathcal{H}^2}\right) \geq g\left(\Vert\hat{f}\Vert_\mathcal{H}\right).$$

Hence $\mathcal{R}[f]\geq\mathcal{R}[\hat{f}]$, and the equality holds if and only if $\delta = 0.$

Suppose $f^{*}$ is a minimizer of $\mathcal{R}$ with expansion $f^{*} = \hat{f^{*}} + \delta^{*}.$ Then we simultaneously have $\mathcal{R}[f^*]\leq \mathcal{R}[\hat{f^{*}}]$ (because $f^{*}$ is a minimizer) and $\mathcal{R}[f^*]\geq \mathcal{R}[\hat{f^{*}}]$ (by the conclusion above). Hence $\mathcal{R}[f^*] = \mathcal{R}[\hat{f^{*}}],$ and $f^{*}=\hat{f^{*}}\in\mathcal{H}_1,$ which concludes the proof.

Remark. The representer theorem finds a finite dimensional subspace $\mathcal{H}_1$ which contains the optimal solution in an infinite dimensional space $\mathcal{H}.$ The solution thereupon can be represented by a finite linear combination of kernel products evaluated on data points in the training set.

Equivalence between KRR and Kriging

Both kernel ridge regression (KRR) and kriging are nonparametric regression models which are applied in spatial interpolation. KRR finds an optimal solution through minimizing some loss function, analogous to other frequentist methods such as maximum likelihood estimation. Kriging, as a Bayesian approach, imposes a prior distribution (which is a Gaussian process) on the spatial field of interest, and then derives the posterior.

In this part we will reveal some intrinsic connection between the two methods (Frequentist & Bayesian).

Setting of kriging. Given $n$ observations $\lbrace(x_i,y_i)\rbrace_{i=1}^n\subset\mathcal{X}\times\mathbb{R},$ consider a nonparametric regression model

\[y_i = f(x_i) + \varepsilon_i,\ \varepsilon_i\overset{\mathrm{i.i.d.}}{\sim}N(0,\sigma^2)\]

with prior $f\sim\mathcal{GP}\left(0,k(\cdot,\cdot)\right).$ The posterior of $f$ is a Gaussian process, with

$$\begin{aligned} \mathbb{E}[f(x)|\mathbf{Y}] &= \mathbf{k}(x)^\top(\mathbf{K} + \sigma^2\mathbf{I})^{-1}\mathbf{Y},\\ \mathrm{Cov}\lbrace f(x),f(x')|\mathbf{Y}\rbrace &= k(x,x') - \mathbf{k}(x)^\top(\mathbf{K} + \sigma^2\mathbf{I}_n)^{-1}\mathbf{k}(x'), \end{aligned}$$

where $\mathbf{k}(x)=\left(k(x,x_1),\cdots,k(x,x_n)\right)^\top,\ \mathbf{K}=\lbrace k(x_i,x_j)\rbrace_{i,j=1}^n,\ \mathbf{Y}=(y_1,\cdots,y_n)^\top.$

Setting of KRR. Given $n$ observations $\lbrace(x_i,y_i)\rbrace_{i=1}^n\subset\mathcal{X}\times\mathbb{R},$ and suppose $f$ is realizable in RKHS $\mathcal{H}$ with RK $k(\cdot,\cdot).$ Choose MSE loss as the loss function, and consider the following kernel ridge regression model:

$$\min_{f\in\mathcal{H}} \sum_{i=1}^n \left\lbrace y_i - f(x_i)\right\rbrace^2 + \lambda\Vert f\Vert_\mathcal{H}^2.$$

where $\lambda>0$ is a hyperparameter. According to the representer theorem, the solution has a finite dimensional representation

\[f(\cdot) = \sum_{i=1}^n \alpha_i k(\cdot,x_i),\]

then the optimization problem can be reformulated as

\[\min_{\alpha\in\mathbb{R}^n}\ (\mathbf{Y} - \mathbf{K}\pmb{\alpha})^\top(\mathbf{Y} - \mathbf{K}\pmb{\alpha}) + \lambda\pmb{\alpha}^\top\mathbf{K}\pmb{\alpha},\]

where $\pmb{\alpha}=(\alpha_1,\cdots,\alpha_n)^\top,\ \mathbf{K}=\lbrace k(x_i,x_j)\rbrace_{i,j=1}^n,\ \mathbf{Y}=(y_1,\cdots,y_n)^\top.$ When $\mathbf{K}$ is nonsingular, the minimizer is unique:

\[\hat{\pmb{\alpha}}=(\mathbf{K}+\lambda\mathbf{I}_n)^{-1}\mathbf{Y},\]

and

\[\hat{f}(x) = \sum_{i=1}^n\hat{\alpha}_ik(x,x_i) = \mathbf{Y}^\top(\mathbf{K}+\lambda\mathbf{I}_n)^{-1}\mathbf{k}(x),\]

where $\mathbf{k}(x)=\left(k(x,x_1),\cdots,k(x,x_n)\right)^\top.$ As we can see, the point estimate $\hat{f}$ is equivalent to the posterior mean of $f$ in Kriging when we replace $\lambda$ with $\sigma^2.$

Spatial Statistics (V): Reproducing Kernel Hilbert Space

Spatial Statistics (VII): Karhunen-Loève Expansion