Vector-Valued RKHS
Motivation
Review: Real (or Complex) RKHS. Let $\mathcal{X}$ be a non-empty set, and let $\mathcal{H}$ be a Hilbert space of functions $h:\mathcal{X}\to\mathbb{R}$ (or $\mathbb{C}$) equipped with inner product $\langle\cdot,\cdot\rangle_\mathcal{H}.$ Then $\mathcal{H}$ is a reproducing kernel Hilbert space, provided that the evaluation functional $\delta_x:\mathcal{H}\to\mathbb{R}$ (or $\mathbb{C}$), $f\mapsto f(x)$ is bounded for all $x\in\mathcal{X}.$
By the Riesz’s representation theorem, for every $x\in\mathcal{X},$ the boundedness (or continuity, equivalently) of linear $\delta_x$ implies that there exists a unique $k_x\in\mathcal{H}$ such that $f(x) = \delta_x f = \langle f,k_x\rangle_\mathcal{H}$ for all $f\in\mathcal{H}.$ Then we define $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ (or $\mathbb{C}$), $(x,x^\prime)\mapsto k_{x^\prime}(x).$ We can verify the following two properties of $k$:
- For all $x\in\mathcal{X},$ the function $k(\cdot,x):\mathcal{X}\to\mathbb{R}$ (or $\mathbb{C}$) is in $\mathcal{H}.$ To see this, just note that $k(\cdot,x)=k_x.$
- For all $f\in\mathcal{H}$ and $x\in\mathcal{X},$ $\langle f,k(\cdot,x)\rangle_\mathcal{H} = \langle f,k_x\rangle = \delta_x f = f(x).$ This property is referred to as the reproducing property.
We call the function $k:\mathcal{X}\to\mathcal{X}\to\mathbb{R}$ (or $\mathbb{C}$) a kernel on $\mathcal{X}.$ It possesses the following properties (For brevity we only discuss $\mathbb{C}$):
- (Symmetry / Conjugate symmetry). $k(x,x^\prime) = \overline{k(x^\prime,x)}$ for all $x,x^\prime\in\mathcal{X}.$
- (Positive definiteness). For all $n\in\mathbb{N},\ x_1,\cdots,x_n\in\mathcal{X},\ \alpha_1,\cdots,\alpha_n\in\mathbb{C},$ we have
$$\sum_{i=1}^n\sum_{j=1}^n \alpha_i\overline{\alpha_j}k(x_i,x_j) \geq 0.$$
In our discussion, we only consider the case for real and complex-valued functions. However, RHKS can be defined for more classes of functions.
Definition 1 (Vector-valued RKHS, vRKHS). Let $\mathcal{H}$ be a Hilbert space of functions $h:\mathcal{X}\to\mathcal{G}$ equipped with inner product $\langle\cdot,\cdot\rangle_\mathcal{H},$ where $\mathcal{G}$ is also a Hilbert space equipped with inner product $\langle\cdot,\cdot\rangle_\mathcal{G}.$ Then $\mathcal{H}$ is a RKHS, provided thay the linear functional $\delta_{x,g}:\mathcal{H}\to\mathbb{C},\ f\mapsto\langle g,f(x)\rangle_\mathcal{G}$ is bounded for all $x\in\mathcal{X}$ and $g\in\mathcal{G}.$
By the Riesz’s representation theorem, for every $x\in\mathcal{X}$ and $g\in\mathcal{G},$ the boundedness of linear $\delta_{x,g}$ implies that there exists a unique $k_{x,g}\in\mathcal{H}$ such that $\langle g,f(x)\rangle_\mathcal{G} = \delta_{x,g} f = \langle f,k_{x,g}\rangle_\mathcal{H}$ for all $f\in\mathcal{H}.$ Moreover, it is easy to see that $k_{x,g}$ is linear with respect to $g.$ Then for each $x\in\mathcal{X},$ we define $k_x:\mathcal{G}\to\mathcal{H}, g\mapsto k_{x,g}.$
Denote by $\mathcal{L}(\mathcal{G})$ the space of all bounded linear operators from $\mathcal{G}$ to $\mathcal{G}.$ We define the kernel $k:\mathcal{X}\times\mathcal{X}\to\mathcal{L}(\mathcal{G})$ as
$$k(x,x^\prime)g = (k_{x^\prime} g)(x)\in\mathcal{G}.$$
All aforementioned properties of RKHS holds:
- (Compatibility). For all $x\in\mathcal{X},$ $k(\cdot,x)=k_x\in\mathcal{H}.$
- (Reproducing property). For all $f\in\mathcal{H},\ x\in\mathcal{X}$ and $g\in\mathcal{G},$ $\langle f,k(\cdot,x)g\rangle_\mathcal{H} = \langle f,k_xg\rangle_\mathcal{H} = \langle g,f(x)\rangle_\mathcal{G}.$
- (Adjoint symmetry). For all $x,x^\prime\in\mathcal{X},$ $k(x,x^\prime) = k(x^\prime, x)^*,$ which is the Hermitian adjoint operator. To see this, just note that
$$\langle k(x,x^\prime)g,g^\prime\rangle_\mathcal{G} = \overline{\langle g^\prime,(k_{x^\prime}g)(x)\rangle_\mathcal{G}} = \overline{\langle k_{x^\prime}g,k_{x}g^\prime\rangle_\mathcal{H}} = \langle k_{x}g^\prime,k_{x^\prime}g\rangle_\mathcal{H} = \langle g,k(x^\prime,x)g^\prime\rangle_\mathcal{G}.$$
- (Positive definiteness). For all $n\in\mathbb{N},\ x_1,\cdots,x_n\in\mathcal{X}$ and $g_1,\cdots,g_n\in\mathcal{G},$
$$\sum_{i=1}^n\sum_{j=1}^n \langle g_i,k(x_i,x_j)g_j\rangle_\mathcal{G} = \sum_{i=1}^n\sum_{j=1}^n \langle k_{x_j}g_j,k_{x_i}g_i\rangle_\mathcal{H}\geq 0.$$
Like the Moore-Aronszajn theorem for real and complex-valued RKHS’s, we can also establish an one-to-one correspondence between the positive kernel and the vector-valued RKHS. A formal statement is given below.
Theorem 1 (Micchelli-Pontil). Let $k:\mathcal{X}\times\mathcal{X}\to\mathcal{L}(\mathcal{G})$ be a positive definite kernel. Then there exists a unique (up to an isometry) RKHS which admits $k$ as the reproducing kernel.
Furthermore, we can construct a vRKHS as the closure of $\mathrm{span}\lbrace k_x g:x\in\mathcal{X},g\in\mathcal{G}\rbrace.$
Properties
Like in real and complex-valued RKHS’s, we have some inequalities between different norms for vRKHS.
Proposition. The following statements are true:
- $\Vert k_x\Vert = \Vert k(x,x)\Vert^{1/2}\ \forall x\in\mathcal{X}.$
- $\Vert k(x,x^\prime)\Vert \leq \Vert k(x,x)\Vert^{1/2}\Vert k(x^\prime,x^\prime)\Vert^{1/2}\ \forall x,x^\prime\in\mathcal{X}.$
- $\Vert f(x)\Vert_\mathcal{G} \leq \Vert f\Vert_\mathcal{H}\Vert k(x,x)\Vert^{1/2}\ \forall f\in\mathcal{H},\ x\in\mathcal{X}.$
- Suppose $\sup_{x\in\mathcal{X}} \Vert k(x,x)\Vert \leq B.$ Then $\Vert f(x)\Vert_\mathcal{G}\leq\sqrt{B}\Vert f\Vert_\mathcal{H}.$
Proof. (i) Fix $x\in\mathcal{X}.$ Then for any $g\in\mathcal{G},$ $\Vert k_xg\Vert_\mathcal{H}^2=\langle k_xg,k_xg\rangle_\mathcal{H} = \langle g,k(x,x)g\rangle_\mathcal{G}$
$$\begin{align*}\Vert k_x\Vert = \sup_{g\in\mathcal{G}\backslash\lbrace 0\rbrace}\frac{\Vert k_xg\Vert_\mathcal{H}}{\Vert g\Vert_\mathcal{G}} &= \sup_{g\in\mathcal{G}\backslash\lbrace 0\rbrace}\frac{\sqrt{\langle g,k(x,x)g\rangle_\mathcal{G}}}{\Vert g\Vert_\mathcal{G}}\\ &\leq \sup_{g\in\mathcal{G}\backslash\lbrace 0\rbrace}\sqrt{\frac{\Vert k(x,x)g\Vert_\mathcal{G}}{\Vert g\Vert_\mathcal{G}}} = \Vert k(x,x)\Vert^{1/2}.\end{align*}$$
Meanwhile,
$$\begin{align*}\Vert k(x,x)g\Vert_\mathcal{G}^2 &= \langle k(x,x)g, (k_xg)(x)\rangle_\mathcal{G} = \langle k_xg,k_x k(x,x)g\rangle_\mathcal{H}\\ &\leq\Vert k_xg\Vert_\mathcal{H}\Vert k_xk(x,x)g\Vert_\mathcal{H}\leq\Vert k_x\Vert^2\Vert g\Vert_\mathcal{G}\Vert k(x,x)g\Vert_\mathcal{G}.\end{align*}$$
Then
$$\Vert k(x,x)\Vert = \sup_{g\in\mathcal{G}\backslash\lbrace 0\rbrace} \frac{\Vert k(x,x)g\Vert_\mathcal{G}}{\Vert g\Vert_\mathcal{G}} \leq \Vert k_x\Vert^2,$$
which concludes the proof.
(ii) Note that $\forall x,x^\prime\in\mathcal{X},\ g\in\mathcal{G},$
$$\begin{align*}\Vert k(x,x^\prime)g\Vert_\mathcal{G}^2 &= \langle k(x,x^\prime)g, (k_{x^\prime}g)(x)\rangle_\mathcal{G} = \langle k_{x^\prime}g, k_xk(x,x^\prime)g\rangle_\mathcal{H}\\ &\leq \Vert k_{x^\prime}g\Vert_\mathcal{H}^2\Vert k_xk(x,x^\prime)g\Vert_\mathcal{H}^2 \leq \Vert k_x\Vert\Vert k_{x^\prime}\Vert\Vert g\Vert_\mathcal{G}\Vert k(x,x)g\Vert_\mathcal{G}\end{align*}.$$
Then
$$\Vert k(x,x^\prime)\Vert = \sup_{g\in\mathcal{G}\backslash\lbrace 0\rbrace} \frac{\Vert k(x,x^\prime)g\Vert_\mathcal{G}}{\Vert g\Vert_\mathcal{G}} \leq \Vert k_x\Vert\Vert k_{x^\prime}\Vert \leq \Vert k(x,x)\Vert^{1/2}\Vert k(x^\prime,x^\prime)\Vert^{1/2}.$$
(iii) For any $f\in\mathcal{H}$ and $x\in\mathcal{X},$ we have
$$\begin{align*}\langle f(x),f(x)\rangle_\mathcal{G} = \langle f,k_xf(x)\rangle_\mathcal{H} \leq \Vert f\Vert_\mathcal{H}\Vert k_xf(x)\Vert_\mathcal{H}\leq \Vert f\Vert_\mathcal{H}\Vert k_x\Vert\Vert f(x)\Vert_\mathcal{G}.\end{align*}$$
Then
$$\Vert f(x)\Vert_\mathcal{G} \leq \Vert f\Vert_\mathcal{H}\Vert k_x\Vert = \Vert f\Vert_\mathcal{H}\Vert k(x,x)\Vert^{1/2}.$$
(iv) immediately follows from (iii).
Vector-valued Regression
Now suppose we are given an observational dataset $\lbrace(x_i,g_i)\rbrace_{i=1}^n.$ It is possible to perform a regression in the RKHS setting. Suppose $\mathcal{H}$ is the RKHS corresponding to the kernel $k:\mathcal{X}\times\mathcal{X}\to\mathcal{L}(\mathcal{G}).$ Inspired by the kernel ridge regression, we propose the following risk functional:
$$\widehat{\mathcal{R}}_\lambda[f] = \sum_{i=1}^n \Vert g_i - f(x_i)\Vert_\mathcal{G}^2 + \lambda\Vert f\Vert_\mathcal{H}^2,\ \lambda > 0.$$
The first term stands for the empirical error, and the second is for regularization.
We have the following representer theorem:
Theorem 2. If $f^*$ minimizes $\widehat{\mathcal{R}}_\lambda$ in $\mathcal{H},$ then it is unique and has the form,
$$f^* = \sum_{j=1}^n k_{x_j}c_j,$$
where the coefficients $\lbrace c_ j\rbrace_ {j\in[n]}\subset\mathcal{G}$ are the unique solution of the linear system
$$\sum_{j=1}^n \left(k(x_i,x_j) + \delta_{ij}\lambda\right)c_j = g_i,\ i\in[n].$$
Proof. By definition, the fitted value is $f^{*}(x_ i) = \sum_ {j=1}^n k(x_ i,x_ j)c_ j,$ with $g_ i - f^{*}(x_ i) = \lambda c_ i.$
Let $f$ be any element of $\mathcal{H}.$ Set $h=f-f^*,$ we have
$$\begin{align*}\langle g_i-f^*(x_i), h(x_i)\rangle_\mathcal{G} &= \lambda\left\langle c_i, f(x_i) - \sum_{j=1}^n k(x_i,x_j)c_j\right\rangle_\mathcal{G},\\ \langle h,f^*\rangle_\mathcal{H} &= \sum_{j=1}^n\langle f,k_{x_j}c_j\rangle_\mathcal{H} - \sum_{i=1}^n\sum_{j=1}^n\langle k_{x_i}c_i, k_{x_j}c_j\rangle_\mathcal{H}\\ &= \sum_{j=1}^n\langle c_i,f(x_i)\rangle_\mathcal{G} - \sum_{i=1}^n\sum_{j=1}^n\langle c_i,k(x_i,x_j)c_j\rangle_\mathcal{G}.\end{align*}$$
For brevity, we apply a matrix representation. Let $\mathbf{K} = \lbrace k(x_ i,x_ j)\rbrace_ {i,j=1}^n,$ $\mathbf{c} = (c_1,\cdots,c_n)^\top,$ $\mathbf{f} = (f(x_1),\cdots,f(x_n))^\top.$ Then
$$\sum_{i=1}^n \langle g_i-f^*(x_i), h(x_i)\rangle_\mathcal{G} = \lambda\langle h,f^*\rangle_\mathcal{H} = \lambda(\mathbf{f}^*\mathbf{c} - \mathbf{c}^*\mathbf{Kc}).$$
Now consider the risk functional:
$$\begin{align*}\widehat{\mathcal{R}}_\lambda(f) - \widehat{\mathcal{R}}_\lambda(f^*) &= \sum_{i=1}^n\Vert h(x_i)\Vert_\mathcal{G}^2 -2\sum_{i=1}^n \mathrm{Re}\langle g_i - f^*(x_i), h(x_i)\rangle_\mathcal{G} + \lambda\Vert h\Vert_\mathcal{H}^2 + 2\lambda\mathrm{Re}\langle h,f^*\rangle_\mathcal{H}\\ &= \sum_{i=1}^n\Vert h(x_i)\Vert_\mathcal{G}^2 + \lambda\Vert h\Vert_\mathcal{H}^2\geq 0.\end{align*}$$
The equality holds if and only if $h\equiv 0.$ Thus we conclude the proof.
Example: Estimation of Conditional Mean Embeddings
Setting. Given two sets $\mathcal{X}$ and $\mathcal{Y}$ with a distribution $P_{XY}$ over random variables $(X,Y)$ from $\mathcal{X}\times\mathcal{Y}.$ We begin with an RKHS $\mathcal{H}(\mathcal{Y})$ with a reproducing kernel $k^\mathcal{Y}:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}.$ Then for every $x\in\mathcal{X},$ the conditional expectation mapping $h\mapsto\mathbb{E}[h(Y)\vert X=x]$ can be represented as
$$\mathbb{E}[h(Y)\vert X=x] = \langle h,\mu(x)\rangle_{\mathcal{H}(\mathcal{Y})},\ h\in\mathcal{H}(\mathcal{Y}),$$
where $\mu(x)\in\mathcal{H}(\mathcal{Y})$ is called the conditional mean embedding of $P_{Y\vert X=x}.$ It is in fact the Bochner’s integral:
$$\mu(x) = \int k^\mathcal{Y}(\cdot,y)\mathrm{d}P_{Y|X}(y|x).$$
A natural optimization problem associated to the estimation of conditional mean is to find a function $\mu:\mathcal{X}\to\mathcal{H}(\mathcal{Y})$ that minimizes the mean square error:
$$\mathcal{E}[\mu] = \sup_{\Vert h\Vert_{\mathcal{H}(\mathcal{Y})}\leq 1}\mathbb{E}\left[\left(\mathbb{E}[h(Y)|X] - \langle h,\mu(X)\rangle_{\mathcal{H}(\mathcal{Y})}\right)^2\right].$$
However, this objective function is intractable since the conditional expectation $\mathbb{E}[h(Y)\vert X]$ is unknown. Hence we propose a surrogate risk function that bounds the MSE from above:
$$\begin{align*} \mathcal{E}[\mu] &= \sup_{\Vert h\Vert_{\mathcal{H}(\mathcal{Y})}\leq 1}\mathbb{E}\left[\left(\mathbb{E}[\langle h,k^\mathcal{Y}(\cdot,Y) - \mu(X)\rangle_{\mathcal{H}(\mathcal{Y})}|X] \right)^2\right]\\ &\leq \sup_{\Vert h\Vert_{\mathcal{H}(\mathcal{Y})}\leq 1}\mathbb{E}\left[\langle h,k^\mathcal{Y}(\cdot,Y) - \mu(X)\rangle_{\mathcal{H}(\mathcal{Y})}^2\right]\\ &\leq \sup_{\Vert h\Vert_{\mathcal{H}(\mathcal{Y})}\leq 1}\Vert h\Vert_{\mathcal{H}(\mathcal{Y})}^2\mathbb{E}\left[\Vert k^\mathcal{Y}(\cdot,Y) - \mu(X)\Vert_{\mathcal{H}(\mathcal{Y})}^2\right]\\ &=\mathbb{E}\left[\Vert k^\mathcal{Y}(\cdot,Y) - \mu(X)\Vert_{\mathcal{H}(\mathcal{Y})}^2\right]. \end{align*}$$
Adding a regularization term to the empirical version of the risk function yields our objective function, which has the same form of the vector-valued regression problem:
$$\widehat{R}_\lambda[\mu] = \sum_{i=1}^n \Vert k^\mathcal{Y}(\cdot,y_i) - \mu(x_i)\Vert_{\mathcal{H}(\mathcal{Y})}^2 + \lambda\Vert\mu\Vert_{\mathcal{H}(\mathcal{X})}^2,$$
where $\mathcal{H}(\mathcal{X})$ is a RKHS of functions $\xi:\mathcal{X}\to\mathcal{H}(\mathcal{Y})$ with reproducing kernel $k^\mathcal{X}:\mathcal{X}\times\mathcal{X}\to\mathcal{L}(\mathcal{H}(\mathcal{Y})) = \mathcal{H}(\mathcal{Y}).$ (Why?)
Now it remains to find a kernel $k^{\mathcal{X}}.$ We start with a real-valued kernel $\Phi$ on $\mathcal{X}$ and its corresponding RKHS $\mathcal{H}_\Phi.$ Then $\mathcal{H}(\mathcal{X})$ is defined as
$$\overline{\mathrm{span}}\left\lbrace\left. h\Phi(\cdot,x): \mathcal{X}\to\mathcal{H}(\mathcal{Y}), x^\prime\mapsto \Phi(x^\prime,x)h\right\vert x\in\mathcal{X},h\in\mathcal{H}(\mathcal{Y})\right\rbrace.$$
For all $g,h\in\mathcal{H}(\mathcal{Y}),x,x^\prime\in\mathcal{X},$ the inner product is defined as
$$\langle g\Phi(\cdot,x),h\Phi(\cdot,x^\prime)\rangle_{\mathcal{H}(\mathcal{X})} = \langle g,h\rangle_{\mathcal{H}(\mathcal{Y})} \Phi(x,x^\prime).$$
Moreover, it can be verified that $\mathcal{H}(\mathcal{X})$ is isomorphic to $\mathcal{H}_\Phi\otimes\mathcal{H}(\mathcal{Y}),$ and
$$k^{\mathcal{X}}(x,x^\prime)h = (k^\mathcal{X}_{x^\prime}h)(x) = h\Phi(\cdot,x^\prime)(x) = \Phi(x,x^\prime)h.$$
Then $k^\mathcal{X}(x,x^\prime) = \Phi(x,x^\prime)\mathcal{I},$ where $\mathcal{I}:\mathcal{H}(\mathcal{Y})\to\mathcal{H}(\mathcal{Y})$ is the identity map on $\mathcal{H}(\mathcal{Y}).$
Estimation. Based on dataset $\lbrace(x_i,y_i)\rbrace_{i=1}^n,$ the solution
$$\mu^* = \underset{\mu\in\mathcal{H}(\mathcal{X})}{\mathrm{argmin}}\ \widehat{\mathcal{R}}_\lambda[\mu]$$
is given by the representer theorem:
$$\mu^* = \sum_{i=1}^n k^\mathcal{X}(\cdot, x_i)c_i,$$
where the coefficients are
$$c_i = \sum_{j=1}^n W_{ij}k^\mathcal{Y}(\cdot,y_j),\ \mathbf{W} = \lbrace W_ {ij}\rbrace_ {i,j=1}^n = (\mathbf{\Phi} + \lambda\mathbf{I})^{-1},\ \mathbf{\Phi}=\lbrace\Phi(x_ i,x_ j)\rbrace_ {i,j=1}^n.$$
References
- Micchelli, C. A. and Pontil, M. (2005). On learning vector-valued functions. In Neural Computation. 17(1):177–204.
- Grünewälder, S., Lever G., Baldassarre L., Patterson S., Gretton A., and Pontil M.. (2012). Conditional mean embeddings as regressors. In Proceedings of the 29th International Coference on International Conference on Machine Learning (ICML’12). 1803–1810.