Global-Local Gaussian Process

Setting. Given a traning set $\lbrace \mathbf{X}_ n,\mathbf{Y}_ n\rbrace = \lbrace (\mathbf{x}_ i,y_ i)\rbrace_ {i=1}^n \in (\mathbb{R}^d\times\mathbb{R})^n.$ To estimate the latent function $f(\cdot)$ from the training set, we use a Gaussian process modeling, i.e., assume that $f$ is a realization from a Gaussian process:

$$\begin{align*} &y_i = f(\mathbf{x}_i) + \epsilon_i,\ \ \epsilon_i\overset{\mathrm{i.i.d.}}{\sim} N(0,\nu^2),\\ &f(\cdot)\sim\mathcal{GP}(\mu,\tau^2 R(\cdot,\cdot)), \end{align*}$$

where $\mu\in\mathbb{R},\tau^2 >0, R:\mathbb{R}^d\times\mathbb{R}^d\to[-1,1]$ are the mean, variance and correlation function of the GP, respecively. Then we can derive the joint distribution of $\mathbf{Y}_n:$

$$\mathbf{Y}_n \sim N(\mu\mathbf{1}_n,\tau^2\mathbf{R}_{nn} + \nu^2\mathbf{I}_n),$$

where $\mathbf{R}_ {nn} := \lbrace R(\mathbf{x}_ i,\mathbf{x}_ j)\rbrace_ {i,j=1}^n$ and $\mathbf{I}_ n$ is the $n\times n$ identity matrix.

Define the nugget $\eta = \nu^2/\tau^2.$ For an arbitrary testing location $\mathbf{x}^{* }\in\mathbb{R}^d,$ define $\mathbf{r}_ n(\mathbf{x}^{* }) = \left(R(\mathbf{x}^{* },\mathbf{x}_1),\cdots,R(\mathbf{x}^{* },\mathbf{x}_n)\right)^\top.$ Then the joint distribution of $(f(\mathbf{x}^{* },\mathbf{Y}_n)$ can be derived. Furthermore, we have the conditional distribution

$$\begin{align*} f(\mathbf{x}^*) &\vert \mathbf{Y}_n \sim N\left(\mu(\mathbf{x}^*),\sigma^2(\mathbf{x}^*)\right),\\ \mu(\mathbf{x}^*) &= \mu + \mathbf{r}_n(\mathbf{x}^*)^\top(\mathbf{R}_{nn} + \eta\mathbf{I}_n)^{-1}(\mathbf{Y}_n - \mu\mathbf{1}_n),\\ \sigma^2(\mathbf{x}^*) &= \tau^2\left\lbrace 1 - \mathbf{r}_n(\mathbf{x}^*)^\top(\mathbf{R}_{nn} + \eta\mathbf{I}_n)^{-1}\mathbf{r}_n(\mathbf{x}^*)\right\rbrace. \end{align*}$$

Suppose the correlation function is specified by parameters $\boldsymbol{\theta}\in\Theta.$ The maximum likelihood estimation (MLE) of $\mu,\tau^2,\eta$ and $\boldsymbol{\theta}$ are given as follows:

$$\begin{align*} \hat{\mu} &= \frac{\mathbf{1}_n^\top(\mathbf{R}_{nn}+\eta\mathbf{I}_n)^{-1}\mathbf{Y}_n}{\mathbf{1}_n^\top(\mathbf{R}_{nn}+\eta\mathbf{I}_n)^{-1}\mathbf{1}_n},\\ \hat{\tau}^2 &= \frac{1}{n}(\mathbf{Y}_n - \hat{\mu}\mathbf{1}_n)^\top(\mathbf{R}_{nn}+\eta\mathbf{I}_n)^{-1}(\mathbf{Y}_n - \hat{\mu}\mathbf{1}_n),\\ \hat{\boldsymbol{\theta}},\hat{\eta} &= \underset{\boldsymbol{\theta}\in\Theta,\eta > 0}{\mathrm{argmin}}\ n\log\hat{\tau}^2 + \log\det(\mathbf{R}_{nn}+\eta\mathbf{I}_n). \end{align*}$$

However, the computation of $(\mathbf{R}_ {nn}+\eta\mathbf{I}_ n)^{-1}$ and $\det(\mathbf{R}_ {nn}+\eta\mathbf{I}_ n)$ suffers from a complexity of $O(n^3).$

Global-Local Gaussian Process

To circumvent the $O(n^3)$ computational complexity, we can build the GP using $m \ll n$ points from the training data, which is a combination of $g$ global points $\mathbf{X}_ g$ and $l$ local points $\mathbf{X}_ l$ such that $m = g + l.$ Given a testing point $\mathbf{x}^{* },$ we denote $\mathbf{X}_ m(\mathbf{x}^{* }) = \mathbf{X}_ g \cup \mathbf{X}_ l(\mathbf{x}^{* }),$ in which $\mathbf{X}_ g\subset \mathbf{X}_ n$ are predetermined and $\mathbf{X}_ l(\mathbf{x}^{* })$ are obtained as the $l$ nearest neighbors to $\mathbf{x}^{* }$ from $\mathbf{X}_ n\backslash \mathbf{X}_ g.$

Correlation

The correlation function $R(\cdot,\cdot)$ is modeled as a convex combination of two kernel functions:

$$R(\mathbf{x},\mathbf{x}') = \lambda G(\mathbf{x},\mathbf{x}') + (1-\lambda)L(\mathbf{x},\mathbf{x}'),\ \ 0 < \lambda < 1. \tag{1}$$

$G(\cdot,\cdot)$ and $L(\cdot,\cdot)$ are designed to capture the global trend and local trend around $\mathbf{x}^{* },$ respectively. $\lambda$ is a hyperparameter controllong the proportion of these two trends.

Let $\mathbf{G}_ {mm},\mathbf{L}_ {mm}$ and $\mathbf{R}_ {mm}$ be the kernel matrices on $\mathbf{X}_ m$ with respect to $G(\cdot,\cdot),L(\cdot,\cdot)$ and $R(\cdot,\cdot).$ Then instead of inverting $\mathbf{R}_ {nn} + \eta\mathbf{I}_ n,$ we only need to invert $\mathbf{R}_ {mm} + \eta\mathbf{I}_ n,$ where

$$\mathbf{R}_{mm} = (1-\lambda)\mathbf{G}_{mm} + \lambda\mathbf{L}_{mm}.$$

Global trend

The objective we desire to achieve with $\mathbf{X}_ g$ is to sufficiently model the global trend in the training data. Mak and Joseph (2018) proposed a nonparametric and model-free data reduction technique known as support points that can produce a small set of points to represent the full dataset. Support points are obtained by minimizing the energy distance (Székely and Rizzo, 2013) between its empirical distribution and that of the dataset using difference-of-convex programming.

The global trend is captured by the power exponential kernel:

$$G(\mathbf{x},\mathbf{x}';\boldsymbol{\theta}_g) = \exp\left(\sum_{i=1}^d \frac{\vert \mathbf{x}(i) - \mathbf{x}'(i)\vert^\alpha}{\theta_g^{(i)}}\right),\ \alpha\in [1,2],\theta_g^{(i)} > 0. \tag{2}$$

The correlation is parametrized by $\boldsymbol{\theta}_g = (\theta_g^{(1)},\cdots,\theta_g^{(d)};\alpha),$ which can be estimated totally using $\mathbf{X}_g:$

$$\begin{align*} \hat{\mu}_g &= \frac{\mathbf{1}_n^\top(\mathbf{G}_{gg}+\eta\mathbf{I}_g)^{-1}\mathbf{Y}_g}{\mathbf{1}_g^\top(\mathbf{G}_{gg}+\eta\mathbf{I}_g)^{-1}\mathbf{1}_g},\\ \hat{\tau}_g^2 &= \frac{1}{g}(\mathbf{Y}_g - \hat{\mu}_g\mathbf{1}_g)^\top(\mathbf{G}_{gg}+\eta\mathbf{I}_g)^{-1}(\mathbf{Y}_g - \hat{\mu}_g\mathbf{1}_g),\\ \hat{\boldsymbol{\theta}}_g,\hat{\eta_g} &= \underset{\boldsymbol{\theta}_g\in\Theta_g,\eta_g > 0}{\mathrm{argmin}}\ g\log\hat{\tau}_g^2 + \log\det(\mathbf{G}_{gg}+\eta_g\mathbf{I}_g). \end{align*} \tag{3}$$

Local trend

With the global points $\mathbf{X}_ g$ selected, $\mathbf{X}_ l(\mathbf{x}^{* })$ are sequentially obtained as the $l$ nearest neighbors to $\mathbf{x}^{* }$ from $\mathbf{X}_ n\backslash \mathbf{X}_ g.$

To ensure identifiability of the local correlation parameters with respect to the global correlation parameters, we apply Wendland’s compactly supported radial function to capture the local trend:

$$L(\mathbf{x},\mathbf{x}') = \left((q+1)\frac{\Vert \mathbf{x}-\mathbf{x}'\Vert_2}{\theta_l} + 1\right)\max\left\lbrace 0, 1-\frac{\Vert \mathbf{x}-\mathbf{x}'\Vert_2}{\theta_l} \right\rbrace^{q+1},\ \ q=\lfloor \frac{d}{2}\rfloor + 2. \tag{4}$$

The parameter $\theta_l$ is set as the covering radius of $\mathbf{X}_ g$ on $\mathbf{X}_ n,$ i.e.,

$$\hat{\theta}_l = \min\left\lbrace\rho:\mathbf{X}_n\subseteq \bigcup_{\mathbf{x}_i\in\mathbf{X}_ g}\mathcal{B}_\rho(\mathbf{x}_i)\right\rbrace, \tag{5}$$

where $\mathcal{B}_ \rho(\mathbf{x})$ is the closed ball of radius $\rho$ centered at $\mathbf{x}.$ Under this setting, $\theta_ l$ is independent of the testing location. In addition, almost surely for any testing location $\mathbf{x}^{* }$ there exists at least one global training point such that the local kernel is active between them, i.e., there exists $\mathbf{x}_ g\in\mathbf{X}_ g$ such that $L(\mathbf{x}^{* },\mathbf{x}_ g) > 0.$ Therefore, $L$ does not neglect the correlation between $\mathbf{x}^{* }$ and $\mathbf{X}_ g.$

Predictive equations

Assume that $R(\cdot,\cdot)$ is fully specfied. Given $\mathbf{Y}_ m = [\mathbf{Y}_ g,\mathbf{Y}_ l],$ we can derive the conditional distribution of $f(\mathbf{x}^{* })\vert\mathbf{Y}_ m$ which is given as follows:

$$\begin{align*} f(\mathbf{x}^*) &\vert \mathbf{Y}_m \sim N\left(\mu(\mathbf{x}^*),\sigma^2(\mathbf{x}^*)\right),\\ \mu(\mathbf{x}^*) &= \hat{\mu}_m + \mathbf{r}_m(\mathbf{x}^*)^\top(\mathbf{R}_{mm} + \eta\mathbf{I}_m)^{-1}(\mathbf{Y}_m - \hat{\mu}_m\mathbf{1}_m),\\ \sigma^2(\mathbf{x}^*) &= \hat{\tau}_m^2\left\lbrace 1 - \mathbf{r}_m(\mathbf{x}^*)^\top(\mathbf{R}_{mm} + \eta\mathbf{I}_m)^{-1}\mathbf{r}_m(\mathbf{x}^*)\right\rbrace,\\ \hat{\mu}_m &= \frac{\mathbf{1}_m^\top(\mathbf{R}_{mm}+\eta\mathbf{I}_m)^{-1}\mathbf{Y}_m}{\mathbf{1}_m^\top(\mathbf{R}_{nn}+\eta\mathbf{I}_m)^{-1}\mathbf{1}_m},\\ \hat{\tau}^2_m &= \frac{1}{m}(\mathbf{Y}_m - \hat{\mu}_m\mathbf{1}_m)^\top(\mathbf{R}_{mm}+\eta\mathbf{I}_m)^{-1}(\mathbf{Y}_m - \hat{\mu}_m\mathbf{1}_m). \end{align*} \tag{6}$$

Furthermore, $\mathbf{R}_ {mm}+\eta\mathbf{I}_ m$ can be represented as follows,

$$\begin{align*} \mathbf{R}_{mm}+\eta\mathbf{I}_m &= \begin{bmatrix} \mathbf{R}_{gg}+\eta\mathbf{I}_g & \mathbf{R}_{gl} \\ \mathbf{R}_{lg} & \mathbf{R}_{ll}+\eta\mathbf{I}_l\end{bmatrix}\\ &= \begin{bmatrix}(1-\lambda)\mathbf{G}_{gg} + \lambda\mathbf{L}_{gg} +\eta\mathbf{I}_g & (1-\lambda)\mathbf{G}_{gl} + \lambda\mathbf{L}_{gl} \\ (1-\lambda)\mathbf{G}_{lg} + \lambda\mathbf{L}_{lg} & (1-\lambda)\mathbf{G}_{ll} + \lambda\mathbf{L}_{ll} + \eta\mathbf{I}_l\end{bmatrix}. \end{align*}$$

Its inverse can be more efficiently computed by

$$(\mathbf{R}_{mm}+\eta\mathbf{I}_m)^{-1}=\begin{bmatrix} \boldsymbol{\Sigma}_{gg}^{-1} + \boldsymbol{\Sigma}_{gg}^{-1}\mathbf{R}_{gl}\mathbf{S}^{-1}\mathbf{R}_{lg}\boldsymbol{\Sigma}_{gg}^{-1} &-\boldsymbol{\Sigma}_{gg}^{-1}\mathbf{R}_{gl}\mathbf{S}^{-1}\\ -\mathbf{S}^{-1}\mathbf{R}_{lg}\boldsymbol{\Sigma}_{gg}^{-1} & \mathbf{S}^{-1} \end{bmatrix},$$

where $\boldsymbol{\Sigma}_ {gg} = \mathbf{R}_ {gg} + \eta\mathbf{I}_ g,\ \boldsymbol{\Sigma}_ {ll} = \mathbf{R}_ {ll} + \eta\mathbf{I}_ l,\ \mathbf{S} = \boldsymbol{\Sigma}_ {ll} - \mathbf{R}_ {lg}\boldsymbol{\Sigma}_ {gg}^{-1}\mathbf{R}_ {gl}.$ Since $\boldsymbol{\Sigma}_ {gg}^{-1}$ can be computed using only $\mathbf{X}_ g,$ it remains to invert a smaller $l\times l$ matrix $\mathbf{S}$ per testing location.

Choice of hyperparameters

Two more parameters $\lambda$ and $\eta_l$ need to be specified in our previous discussion. Since we have unused data in the training set, we can propose a more robust approach for their estimation. We sample $\lbrace \mathbf{X}_ v,\mathbf{Y}_ v\rbrace$ from $\lbrace \mathbf{X}_ n,\mathbf{Y}_ n\rbrace \backslash \lbrace \mathbf{X}_ g,\mathbf{Y}_ g\rbrace$ by Twinning to create a validation set.

$\lambda$ and $\eta_l$ are estimated by minimizing the prediction error:

$$\hat{\lambda},\hat{\eta}_l = \underset{\lambda\in[0,1],\eta_l > 0}{\mathrm{argmin}} \sum_{(\mathbf{x},y_\mathbf{x})\in\lbrace\mathbf{X}_v,\mathbf{Y}_v\rbrace}\left(y_\mathbf{x} - \mu(\mathbf{x})\right)^2, \tag{7}$$

where $\mu(\mathbf{x})$ is given in the predictive equations with $\eta = (1-\lambda)\eta_g + \lambda\eta_l.$

Algorithm: Twin-GP

Based on the previous discussions, the TwinGP procedure is summarized below.

Input: training set $\lbrace \mathbf{X}_ n,\mathbf{Y}_ n\rbrace$ and testing locations $\mathbf{X}_ t^{* }$
Identify global training points $\mathbf{X}_ g$
Estimate $\boldsymbol{\theta}_ g$ as in $(3)$
Estimate $\theta_l$ as in $(5)$
Estimate $\lambda$ and $\eta_l$ as in $(7)$
for $\mathbf{x}_ {* }$ in $\mathbf{X}_ t^{* }$ do
1. Identify local training points $\mathbf{X}_ l(\mathbf{x}^{* })$
2. Compute $\mu(\mathbf{x}^{* })$ and $\sigma^2(\mathbf{x}^{* })$ as in $(6)$
return $\lbrace\mu(\mathbf{x}^{* }),\sigma^2(\mathbf{x}^{* })\rbrace_ {\mathbf{x}^{* }\in\mathbf{X}_ t^{* }}$

The computational complexity of Twin-GP can be deconstructed as follows.

Testing

For any testing location $\mathbf{x}^{* },$ identifying $\mathbf{X}_ l(\mathbf{x}^{* })$ using $kd$-tree on average is $O(\log n).$
Computing $\mu(\mathbf{x}^{* })$ and $\sigma(\mathbf{x}^{* })$ is dictated by computing the inversion of $\mathbf{R}_ {mm}+ \eta\mathbf{I}_ m.$ The block matrix inversion gives a complexity of $O(l^3 + g^2l).$

Training

Obtaining a twinning sample to identify $\mathbf{X}_g$ is on average $O(n\log n).$
The complexity in estimating $\boldsymbol{\theta}_ g$ is dictated by inversion of $\mathbf{G}_ {gg} + \eta_ g\mathbf{I}_ g$ that is $O(g^3).$
$\theta_l$ can be estimated efficiently with a $kd$-tree in $O(n\log g).$
The complexity in estimating $\lambda$ and $\eta_ l$ is similar to testing, i.e., $O(g^2l + l^3)$ per validation point.

Overall Complexity

The overall computational complexity of TwinGP is $O(g^3 + tg^2l + tl^3)$ where $t$ is the number of testing locations. If we select $g$ to be order of $\sqrt{n},$ then the training complexity reduces to $O(n^{3/2}).$

References

Akhil Vakayil and V. Roshan Joseph, 2023. A Global-Local Approximation Framework for Large-Scale Gaussian Process Modeling. arXiv preprint arXiv: 2305.10158.

Contextual Bandits with Confounding Bias and Missing Observations

Bayesian Optimization for Inverse Problems