Spatial Statistics (II): Kernel Functions

Inner Product and Hilbert Space

Definition (Inner product). Let $\mathcal{H}$ be a vector space over $\mathbb{R}$. A function $\langle\cdot,\cdot\rangle_\mathcal{H}:\mathcal{H}\times\mathcal{H}\to\mathbb{R}$ is said to be an inner product on $\mathcal{H}$ if it satisfies the follwing three properties:

(Symmetry) $\langle f,g\rangle_\mathcal{H} = \langle g, f\rangle_\mathcal{H},\ f,g\in\mathcal{H};$
(Linearity) $\langle \alpha f_1 + \beta f_2,g\rangle_\mathcal{H} = \alpha\langle f_1,g\rangle_\mathcal{H} + \beta\langle f_2,g\rangle_\mathcal{H},\ \alpha,\beta\in\mathbb{R},\ f_1,f_2,g\in\mathcal{H};$
(Non-negativeness) $\langle f,f\rangle_\mathcal{H} \geq 0\ \forall f\in\mathcal{H}$, and $\langle f,f\rangle_\mathcal{H}=0$ if and only if $f=0$.

A norm is induced by the inner product: $\Vert f\Vert_\mathcal{H} = \sqrt{\langle f,f\rangle_\mathcal{H}}$.

Remark.

A vector space equipped with an inner product space is called a inner product space or a pre-Hilbert space.
For a vector space over $\mathbb{C}$, the first property is modified as conjugate symmetry: $\langle f,g\rangle_\mathcal{H} = \overline{\langle g, f\rangle}_\mathcal{H},\ f,g\in\mathcal{H}.$

Definition (Hilbert Space) A Hilbert space is a complete inner product space. In particular, every Hilbert space is a Banach space with respect to the norm induced by its inner product.

Remark. Recall the definition of Cauchy sequence and Banach space:

A sequence $\lbrace f_n\rbrace_{n\in\mathbb{N}^*}\subset\mathcal{H}$ is called a Cauchy sequence if $\forall \varepsilon > 0,\ \exists N\in\mathbb{N}^{*}$ such that for any $n > m > N,\ \Vert f_n - f_m\Vert_\mathcal{H}\leq\varepsilon$.
A normed vector space $(\mathcal{H},\Vert\cdot\Vert_\mathcal{H})$ is called a Banach space if any Cauchy sequence $\lbrace f_n\rbrace_{n\in\mathbb{N}^*}\subset\mathcal{H}$ converges to some $f_\infty\in\mathcal{H}$, i.e. $\lim_{n\to\infty}\Vert f_n - f_\infty\Vert_\mathcal{H}=0.$

Then, the completeness of a Hilbert space $\mathcal{H}$ states that every Cauchy sequence in $\mathcal{H}$ converges with respect to $\Vert\cdot\Vert_\mathcal{H}$ to an element in $\mathcal{H}$.

Kernel Functions

Definition (Kernel). Let $\mathcal{X}$ be a non-empty set. A function $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is called a kernel on $\mathcal{X}$ if there exists an $\mathbb{R}$-Hilbert space $\mathcal{H}$ and a map $\phi:\mathcal{X}\to\mathcal{H}$ such that $\forall x,y\in\mathcal{X},$

$$k(x,y) = \langle\phi(x),\phi(y)\rangle_\mathcal{H}.$$

It is evident that $k(\cdot,\cdot)$ is symmetric, i.e. $k(x,y)=k(y,x)\ \forall,x,y\in\mathcal{X}$.

Remark. In machine learning, $\phi$ is called a feature map, and $\mathcal{H}$ is called a feature space of $k$.

Property. All kernel functions are positive definite. More specifically, let $\mathcal{X}$ be a non-empty set and $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ a kernel on it, then $\forall n\in\mathbb{N}^{*},\ \alpha_1,\cdots,\alpha_n\in\mathbb{R},\ x_1,\cdots,x_n\in\mathcal{X},$

$$\sum_{i=1}^n\sum_{j=1}^n \alpha_i k(x_i,x_j)\alpha_j\geq 0.$$

Proof. Fix $\alpha_{1:n}$ and $x_{1:n}$. Since $k(\cdot,\cdot)$ is a kernel on $\mathcal{X}$, there exists an $\mathbb{R}$-Hilbert space $\mathcal{H}$ and a map $\phi:\mathcal{X}\to\mathcal{H}$,

$$\begin{aligned}\sum_{i=1}^n\sum_{j=1}^n \alpha_i k(x_i,x_j)\alpha_j &= \sum_{i=1}^n\sum_{j=1}^n \langle\alpha_i\phi(x_i), \alpha_j\phi(x_j)\rangle_\mathcal{H} \\ &= \left\langle\sum_{i=1}^n\alpha_i\phi(x_i), \sum_{j=1}^n\alpha_j\phi(x_j)\right\rangle_\mathcal{H} = \left\Vert\sum_{i=1}^n\alpha_i\phi(x_i)\right\Vert_\mathcal{H}^2\geq 0.\end{aligned}$$

Theorem (Symmetric, positive definite functions are kernels.) A function $k: \mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is a kernel if and only if it is symmetric and positive definite.

Proof. In view of the discussion above, it suffices to show that a symmetric and positive function $k(\cdot,\cdot)$ is a kernel. Let

$$\mathcal{H}_0 = \left\lbrace\sum_{i=1}^n x_ik(\cdot,x_i):\ n\in\mathbb{N},\ c_1,\cdots,c_n\in\mathbb{R},\ x_1,\cdots,x_n\in\mathcal{X}\right\rbrace,$$

and define $\langle\cdot,\cdot\rangle_{\mathcal{H}_0}: \mathcal{H}\times\mathcal{H}\to\mathbb{R}$ such that for

$$f:=\sum_{i=1}^m\alpha_ik(\cdot, x_i) \in\mathcal{H}_0,\ g:=\sum_ {j=1}^n\beta_jk(\cdot, y_j)\in\mathcal{H}_0,\ \ x_1,\cdots,x_m,y_1,\cdots,y_n\in\mathcal{X},$$

$$\langle f,g\rangle_{\mathcal{H}_0}=\sum_{i=1}^m\sum_ {j=1}^n\alpha_i\beta_jk(x_i, y_j).$$

It is easy to verify that $\langle\cdot,\cdot\rangle_{\mathcal{H}_ 0}$ is an inner product on $\mathcal{H}_ 0$. Now let $\mathcal{H}$ be a completion of $\mathcal{H_ 0}$ with respect to $\langle\cdot,\cdot\rangle_{\mathcal{H}_ 0}$, then we have

$$\langle k(\cdot, x), k(\cdot,y)\rangle_{\mathcal{H}} = \langle k(\cdot, x), k(\cdot,y)\rangle_{\mathcal{H}_0} = k(x,y)\ \forall x,y\in\mathcal{X}.$$

Hence $\phi:x\mapsto k(\cdot,x)$ defines a feature map of $k$.

Properties of Kernels

Some basic rules are listed below.

(Linearity) If $k_1,k_2$ are kernels, then $\forall \alpha,\beta \geq 0$, $\alpha k_1 + \beta k_2$ is a kernel.
(Limit) For any kernel series $\lbrace k_n\rbrace_{n\in\mathbb{N}^{*}}$, if $\lim_{n\to\infty} k_n = k$ uniformly, then $k$ is a kernel.
(Product) If $k_1,k_2$ are kernels, then $k_1\cdot k_2$ is a kernel.

Proof. This is an immediate corollary of Schur’s product theorem. Fix a positive integer $n$ and $\ x_1,\cdots,x_n\in\mathcal{X}$, and denote matrices

$$\mathbf{K}_l = \lbrace k_l(x_i,x_j)\rbrace_{i,j=1}^n,\ l=1,2.$$

It suffices to show the Hadamard product $\mathbf{K} = \mathbf{K_1}\circ\mathbf{K_2}$ is positive semidefinite. For any $\mathbf{a}\in\mathbb{R}^n$, denote $\mathbf{A}$ to be the diagonal matrix such that $\mathbf{A}_{ii}=\mathbf{a}_i$, then

$$\begin{aligned}\mathbf{a}^\top\mathbf{Ka} &= \mathrm{trace}\left\lbrace(\mathbf{A}\mathbf{K}_1)^\top\mathbf{K}_2\mathbf{A}\right\rbrace \\ &= \mathrm{trace}\left\lbrace\mathbf{K}_1\mathbf{A}\mathbf{K}_2\mathbf{A}\right\rbrace \\ &= \mathrm{trace}\left\lbrace\mathbf{K}_1^{1/2}\mathbf{A}\mathbf{K}_2^{1/2}\mathbf{K}_2^{1/2}\mathbf{A}\mathbf{K}_1^{1/2}\right\rbrace \\ &= \left\Vert\mathbf{K}_2^{1/2}\mathbf{A}\mathbf{K}_1^{1/2}\right\Vert_\mathrm{F}^2 \geq 0,\end{aligned}$$

which concludes the proof.
(Polynomial kernels) Let $\mathbf{x},\mathbf{y}\in\mathbb{R}^d$ for $d\geq 1$, and let $m$ be a positive integer and $c\geq 0$ be a non-negative real. Then

$$k(\mathbf{x},\mathbf{y}):= (c+\langle \mathbf{x},\mathbf{y}\rangle)^m$$

is a valid kernel. This is an immediate corollary of the first and third properties.
(Taylor series) Assume the Taylor series
$$f(z) = \sum_{n=0}^\infty a_nz^n,\ \ \vert z\vert < r, z\in\mathbb{R}$$

converges for some $r\in(0,\infty]$, with $a_n\geq 0$ for all $n\geq 0$. Define $\mathcal{X}$ to be the $\sqrt{r}$-ball in $\mathbb{R}^d$. Then for any $\mathbf{x},\mathbf{y}\in\mathcal{X}$,

$$k(\mathbf{x},\mathbf{y}) = f(\langle \mathbf{x}, \mathbf{y}\rangle) = \sum_{n=0}^\infty a_n\langle \mathbf{x}, \mathbf{y}\rangle^n$$

defines a valid kernel. This is a corollary of properties 1-3.
(Exponential) The exponential kernel on $\mathbb{R}^d$ is defined as

$$k(\mathbf{x},\mathbf{y}) := \exp(\langle \mathbf{x}, \mathbf{y}\rangle).$$

Exapmles of Kernels

Gaussian RBF (radial basis function) kernels. Given $\sigma > 0$, for $\mathbf{x},\mathbf{y}\in\mathbb{R}^d$, the Gaussian RBF kernel is defined as
$$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{1}{2\sigma^2}\Vert \mathbf{x} - \mathbf{y}\Vert^2\right).$$

Proof. Let $Z$ be a $d$-dimensional Gaussian vector such that $Z\sim N(0,\sigma^{-2}I_d)$. Then the characteristic function of $Z$ can be calculated:

$$\varphi_Z(\lambda) = \mathbb{E}\left[\exp\left(\mathrm{i}\lambda^\top Z\right)\right] = \exp\left(-\frac{\lambda^\top\lambda}{2\sigma^2}\right),\ \ \mathrm{i}=\sqrt{-1}.$$

Fix $n\in\mathbb{N}^{*}$ and $\alpha_1,\cdots,\alpha_n\in\mathbb{R},\ \mathbf{x}_1,\cdots,\mathbf{x}_n\in\mathbb{R}^d$, we have

$$\begin{aligned}\sum_{s=1}^n\sum_{t=1}^n \alpha_s \exp\left(-\frac{\Vert \mathrm{x}_s - \mathrm{x}_t\Vert^2}{2\sigma^2}\right)\alpha_t &= \sum_{s=1}^n\sum_{t=1}^n \alpha_s\mathbb{E}\left[\exp(\mathrm{i}\mathbf{x}_s^\top Z - \mathrm{i}\mathbf{x}_t^\top Z)\right]\alpha_t \\ &= \mathbb{E}\left[\sum_{s=1}^n\sum_{t=1}^n\alpha_s\exp(\mathrm{i}\mathbf{x}_s^\top Z) \exp(-\mathrm{i}\mathbf{x}_t^\top Z)\alpha_t\right] \\ &= \mathbb{E}\left[\sum_{s=1}^n\alpha_s\exp(\mathrm{i}\mathbf{x}_s^\top Z) \sum_{t=1}^n\alpha_t\exp(-\mathrm{i}\mathbf{x}_t^\top Z)\right] \\ &= \mathbb{E}\left[\sum_{s=1}^n\alpha_s\exp(\mathrm{i}\mathbf{x}_s^\top Z) \cdot\sum_{t=1}^n\alpha_t\overline{\exp(\mathrm{i}\mathbf{x}_t^\top Z)}\right] \\ &= \mathbb{E}\left[\left\vert\sum_{s=1}^n\alpha_s\exp(\mathrm{i}\mathbf{x}_s^\top Z)\right\vert^2\right] \geq 0.\end{aligned}$$

Therefore the Gaussian RBF is positive definite, hence is a valid kernel.
Laplacian kernels. Given $\alpha > 0$, for $\mathbf{x},\mathbf{y}\in\mathbb{R}^d$, the Laplacian kernel is defined as
$$k(\mathbf{x}, \mathbf{y}) = \exp\left(-\alpha\sum_{i=1}^n \vert \mathbf{x}_i - \mathbf{y}_i\vert\right).$$

The proof is analogous to Gaussian RBF, with the Gaussian random variable replaced by Cauchy variable. Note the characteristic function of Cauchy distribution with location parameter $0$ and scale parameter $\alpha$ is $\hat{\mu}(\lambda) = \exp\left(-\alpha\vert\lambda\vert\right)$.
Matérn kernels. The form of the Matérn class of functions is given by
$$k(x,y) = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\vert x-y\vert}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\vert x-y\vert}{\ell}\right),$$

where $\ell > 0$ is a length-scale parameter, $\nu > 0$ is a smoothing parameter, and $K_\nu$ is the modified Bessel function of second type.

Spatial Statistics (I): Random Field

Spatial Statistics (III): Gaussian Process