Bayesian Optimization for Inverse Problems

Inverse Problems

Let $y^{* }\in\mathcal{Y}$ denote some noisy structured observational data that could be a scalar, a function or a tensor, and $H:\Theta\to\mathcal{Y}$ denotes a correctly specified model from a parameter space $\Theta$ to the output space $\mathcal{Y}.$ Then the objective of the inverse problem is to identify $\theta^{* }\in\Theta$ such that

$$y^* = H(\theta^*) + \epsilon,$$

where $\epsilon$ represents the unknown parameters $\theta\in\Theta.$ Practically, the inverse of $H$ is often intractable, hence we formulate the problem as a least squares problem:

$$\theta^* = \underset{\theta\in\Theta}{\mathrm{argmin}}\Vert H(\theta) - y^*\Vert^2_{\mathcal{Y}},$$

where $\Vert\cdot\Vert_\mathcal{Y}:\mathcal{Y}\to[0,\infty]$ is some norm that serves as a distance measure in space $\mathcal{Y}.$

The least squares problem can be very knotty, because the model $H$ is usually complicated or black-box, without a computable gradient or Hessian. The high computational cost of $H$ also rules out the possibility of precise numerical approximation. Moreover, multiple local minima could exist due to the noise-contaminated observations and possible information loss in the forward process of evaluating $H.$

Bayesian Optimization

We consider the task of minimizing some black-box function $f:\mathcal{X}\to\mathbb{R}$ which could be expensive to evaluate.

Suppose we have a noise-free dataset $\mathcal{D}_ n = \lbrace (x_i,f(x_i))\rbrace_ {i=1}^n.$ We place a Gaussian process prior on the target $f:$

$$f\sim\mathcal{GP}(\mu,k(\cdot,\cdot)),$$

where $\mu$ is the mean function and $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is a preselected kernel function. For an arbitrary location $x\in\mathcal{X},$ define $\mathbf{k}(x) = \left(k(x,x_1),\cdots,k(x,x_n)\right)^\top,$ and denote the covariance matrix on dataset $\mathcal{D}_ n$ by $\mathbf{K}=\lbrace k(x_ i,x_ j)\rbrace_ {i,j=1}^n.$ Then we can derive the conditional distribution of $f(x)$ on $\mathcal{D}_ n:$

$$\begin{align*} f(x)&|\mathcal{D}_n \sim N\left(\tilde{\mu}(x),\tilde{\sigma}^2(x)\right),\\ \tilde{\mu}(x) &= \mu(x) + \mathbf{k}(x)^\top\mathbf{K}^{-1}(f(x_{1:n})-\mu(x_{1:n})),\\ \tilde{\sigma}^2(x) &= k(x,x) - \mathbf{k}(x)^\top\mathbf{K}^{-1}\mathbf{k}(x). \end{align*}$$

Bayesian optimization selects the next sample via an acquisition function $\alpha∶ \mathcal{X}\to\mathbb{R}$ that assigns utility to any unseen input $x$ based on the inferred predictive distribution from the GP model.

Define the Expected Improvement (EI) acquisition function as follows:

$$\alpha_\mathrm{EI}(x;\mathcal{D}_n) := \mathbb{E}\left[\lbrace f_\min - f(x) \rbrace^+ | \mathcal{D}_n\right].$$

This aquisition computes how much we can expect to improve over the best value $f_\min = \min_i f(x_i)$ we have obtained so far. Given the posterior of $f(x),$ a closed form of $\alpha_ \mathrm{EI}$ can be derived:

$$\alpha_\mathrm{EI}(x;\mathcal{D}_n) = \tilde{\sigma}(x)\left\lbrace\left(\frac{f_\min - \tilde{\mu}(x)}{\tilde{\sigma}(x)}\right)\Phi\left(\frac{f_\min - \tilde{\mu}(x)}{\tilde{\sigma}(x)}\right) + \phi\left(\frac{f_\min - \tilde{\mu}(x)}{\tilde{\sigma}(x)}\right)\right\rbrace,$$

where $\Phi(\cdot)$ and $\phi(\cdot)$ are the c.d.f. and p.d.f. of the standard Gaussian distribution $N(0,1),$ respectively.

The next sample is selected by maximizing the EI:

$$x_{n+1} = \underset{x\in\mathbb{R}^d}{\mathrm{argmax}}\ \alpha_\mathrm{EI}(x;\mathcal{D}_{n})$$

This approach tends to select a setting with large predictive variance $\tilde{\sigma}^2(x)$ (exploration), or small predictive mean $\tilde{\mu}(x)$ (exploitation), hence naturally balances between exploration and exploitation.

Case I: Scalar Output

Consider a simulator-based model $H:\Theta\to\mathbb{R}$ with a scalar output $y^{* }\in\mathbb{R},$ the inverse problem can be formulated as

$$\theta^{* } = \underset{\theta\in\Theta}{\mathrm{argmin}}\left(H(\theta) - y^*\right)^2.$$

The standard Bayes Optimization approach is not directly applicable in the least squares problem, which has some deficiencies:

The objective function $f(x) = \left(H(\theta) - y^*\right)^2$ is nonnegative, while a predictive distribution from GP has support in negative domain; to address this problem one can place a log-Gaussian prior on $f.$
Some analytical properties of $H$ is neglected which might carry helpful information for optimization.

Generalized chi-square approach

Instead of modeling $f,$ we place a GP prior on $H,$ the simulator-based function. Then for any input $\theta\in\Theta,$ $H(\theta)\vert\mathcal{D}_ n$ follows a Gaussian distribution, and $f\vert\mathcal{D}_ n$ follows a generalized chi-square distribution, whose c.d.f. and p.d.f. are denoted as $G_ {\chi^2}$ and $g_ {\chi^2}.$

Let $f_\min = \min_i f(\theta_i)$ be the best value from sample $\mathcal{D}_ n = \left\lbrace\left(\theta_ i,H(\theta_ i),f(\theta_ i) = (H(\theta_ i) - y^{* })^2\right)\right\rbrace_ {i=1}^n.$ The EI of any $\theta$ can be derived as

$$\begin{align*} \alpha_\mathrm{EI}(\theta;\mathcal{D}_n) &= \mathbb{E}\left[\lbrace f_\min - f(\theta) \rbrace^+ | \mathcal{D}_n\right]\\ &= \int_0^{f_\min} (f_\min - t)g_{\chi^2}(t)\mathrm{d}t\\ &= f_\min G_{\chi^2}(f_\min) - \int_0^{f_\min}tg_{\chi^2}(t)\mathrm{d}t \\ &= f_\min G_{\chi^2}(f_\min) - \left.tG_{\chi^2}(t)\right|_0^{f_\min} + \int_0^{f_\min}G_{\chi^2}(t)\mathrm{d}t\\ &= \int_0^{f_\min}G_{\chi^2}(t)\mathrm{d}t. \end{align*}$$

Then it remains to estimate the left tail probability of a generalized chi-square distribution only.

Case II: Functional Output

Now we consider a functional output $H(\cdot;\theta)$ with the target functional $y^{* }(\cdot)$ defined in a compact set $\mathcal{T}.$ The inverse problem can be formulated as

$$\theta^{* } = \underset{\theta\in\Theta}{\mathrm{argmin}}\int_\mathcal{T}\left(H(t;\theta) - y^*(t)\right)^2\mathrm{d}t.$$

Functional principal component analysis (FPCA)

Given a random function $Y(t)$ supported on a compact set $\mathcal{T}$ with mean function $\mu(t)$ and covariance function $k:\mathcal{T}\times\mathcal{T}\to\mathbb{R},$ under some regularity conditions, $k(\cdot,\cdot)$ has the spectral decmoposition

$$k(s,t) = \sum_{l=1}^\infty\lambda_l\phi_l(s)\phi_l(t),$$

where $\lambda_1\geq\lambda_2\geq\cdots\geq 0$ are eigenvalues with respect to an orthogonal basis $\lbrace \phi_ l\rbrace_ {l=1}^\infty$ in the space of intergrable functions on $\mathcal{T}.$

For any square-integrable function $y_i(t)$ on $\mathcal{T},$ we have the Karhunen Loève expansion

$$y_i(t) = \sum_{l=1}^\infty \lambda_l^{1/2}\beta_l^i\phi_l(t),$$

where $\beta_ l^i = \lambda_ l^{-1/2}\int_ \mathcal{T}y_ i(t)\phi_l(t)\mathrm{d}t$ is the $l^\text{th}$ principal component score of $y_i.$ Furthermore, the square distance between $y_i$ and $y_j$ can be represented as

$$\int_\mathcal{T} \left(y_i(t) - y_j(t)\right)^2\mathrm{d}t = \sum_{l=1}^\infty \lambda_l(\beta_l^i - \beta_l^j)^2.$$

In practical, we reserve only the top $L$ pricipal components such that $\left(\sum_ {l=1}^L\lambda_ l\right)/\left(\sum_ {l=1}^\infty\lambda_ l\right) \geq 0.99$ for approximation, and discard the remainder.

Bayesian optimization for functional output

Assume that both $H(\cdot;\theta)$ and $y^{* }(\cdot)$ are square-integrable in $\mathcal{T}.$ Then the objective function is given by

$$f(\theta) = \sum_{l=1}^L\lambda_l(\beta_l(\theta) - \beta_l^*)^2,$$

where $\beta_ l(\theta)$ and $\beta_ l^{* }$ are the $l^\text{th}$ principal component score for $H(\cdot;\theta)$ and $y^{* }(\cdot),$ respectively.

The orthogonality of eigenfunctions implies the independence of different principal component scores, hence we can fit independent GP model for each $\beta_l(\theta)$ and derive the posterior, which is a Gaussian distribution. Furthermore, $f(\theta),$ as a linear combination of Gaussian variables, follows a generalized chi-square distribution. Thus we can derive the EI aquisation as in the scalar-output case and conduct optimization.

The algorithm for Bayesian Optimization for Functional Output (BOFO) is summerized below.

Input: dataset $\mathcal{D}_ n = \left\lbrace \left(\theta_i,y_i(\cdot)=H(\cdot;\theta_i),f(\theta_i) = \Vert y_i(\cdot)-y^{* }(\cdot)\Vert_ {L_ 2(\mathcal{T})}^2\right)\right\rbrace_ {i=1}^n.$
Estimate the mean function $\mu(\cdot),$ eigenvalues $\lbrace \lambda_ l\rbrace$ and eigenfunctions $\lbrace \phi_ l\rbrace$
Compute the principal component scores $\lbrace\beta_ l^i\rbrace_ {i=1,l=1}^{n,L}$ for observations and $\lbrace\beta_ l^{* }\rbrace_ {l=1}^L$ for the target.
Fit independent GP models for each component score $\beta_ l(\theta)$ using $\lbrace \beta_ l^i\rbrace_ {i=1}^n.$
Compute the predictive generalized chi-square distribution for $f(\theta),$ and find the next sample by solving
$$\theta_{n+1} = \underset{\theta\in\Theta}{\mathrm{argmax}}\ \alpha_\mathrm{EI}(\theta;\mathcal{D}_n).$$
Evaluate $y_ {n+1}(\cdot)=H(\cdot;\theta_ {n+1}),$ and add to the dataset
$$\mathcal{D}_{n+1}=\mathcal{D}_n \cup \left\lbrace \left(\theta_{n+1},y_{n+1}(\cdot)=H(\cdot;\theta_{n+1}),f(\theta_{n+1}) = \Vert y_{n+1}(\cdot)-y^{* }(\cdot)\Vert_ {L_ 2(\mathcal{T})}^2\right)\right\rbrace$$
Set $n\gets n+1$ and repeat until convergence, and finally return the sample with the smallest error.

References

Huang, C., Ren, Y., McGuinness, E. K., Losego, M. D., Lively, R. P., & Joseph, V. R. (2021). Bayesian optimization of functional output in inverse problems. Optimization and Engineering, 22, 2553-2574.

Global-Local Gaussian Process

Support Points