Concentration of Self-Normalized Processes

Setting. We fix our discussion on a probability space $(\Omega,\mathcal{F},\mathbb{P}).$ If a stochatic process $\lbrace Y_t\rbrace_ {t=1}^\infty$ is a martingale adapted to the filtration $\lbrace\mathcal{F}_ t\rbrace_ {t=0}^\infty,$ then for any $t,$

$$\begin{align*} &\mathbb{E}\vert Y_t\vert < \infty,\\ &\mathbb{E}[Y_t|\mathcal{F}_{t-1}] = Y_{t-1}\ \mathrm{a.s.}. \end{align*}$$

Now, let $\lbrace\mathcal{F}_ t\rbrace_ {t=0}^\infty$ be a filtration and $\lbrace\eta_ t\rbrace$ a real-valued stochastic process such that $\eta_ t$ is $\mathcal{F}_ t$-measurable. Moreover, let $\lbrace X_t\rbrace_ {t=1}^\infty$ be an $\mathbb{R}^d$-valued stochastic process such that $X_ t$ is $\mathcal{F}_ {t-1}$-measurable.

Assumption 1 (Conditional sub-Gaussianity) There exists some $\sigma\geq0$ such that for any $t,$ $\eta_ t$ is conditionally $\sigma$-sub-Gaussian, i.e. $\forall\lambda\in\mathbb{R},$

$$\mathbb{E}[\eta_t|\mathcal{F}_{t-1}] = 0,\ \mathbb{E}[\exp(\lambda\eta_t)|\mathcal{F}_{t-1}] = \exp\left(\frac{\lambda^2\sigma^2}{2}\right).$$

A supermartingale statement

Lemma 1. For an arbitrary $\lambda\in\mathbb{R}^d$ and $t\geq 0,$ define

$$ M_t^\lambda = \exp\left(\sum_{s=1}^t \left[\frac{\eta_s\lambda^\top X_s}{\sigma} - \frac{1}{2}(\lambda^\top X_s)^2\right]\right). $$

Let $\tau$ be a stopping time with respect to the filtration $\lbrace\mathcal{F}_ t\rbrace_ {t=0}^\infty.$ Then $M_ \tau^\lambda$ is almost surely well-defined and

$$\mathbb{E}[M_\tau^\lambda]\leq 1.$$

Proof. We claim that $\lbrace M_t^\lambda\rbrace_ {t=0}^\infty$ is a supermartingale. Let

$$\begin{align*} \Delta_t^\lambda = \exp\left(\frac{\eta_t\lambda^\top X_t}{\sigma} - \frac{1}{2}(\lambda^\top X_t)^2\right). \end{align*}$$

From the sub-Gaussianity of $\eta_t,$ we immediately have

$$\mathbb{E}[\Delta_t^\lambda|\mathcal{F}_{t-1}]\leq 1.$$

Obviously, $\Delta_ t^\lambda$ is $\mathcal{F}_ t$ is measurable, as is $M_ t^\lambda.$ Moreover,

$$\begin{align*} \mathbb{E}[M_ t^\lambda|\mathcal{F}_{t-1}] &= \mathbb{E}[M_{t-1}^\lambda \Delta_ t^\lambda|\mathcal{F}_{t-1}] \\ &= M_{t-1}^\lambda \mathbb{E}[\Delta_ t^\lambda|\mathcal{F}_{t-1}] \leq M_{t-1}^\lambda. \end{align*}$$

Hence $\lbrace M_ t^\lambda\rbrace_ {t=0}^\infty$ is a supermartingale with $0\leq\mathbb{E}[M_ t^\lambda\vert\mathcal{F}_ {t-1}]\leq 1.$

By the convergence theorem for nonnegative supermartingales, $M_ \infty^\lambda = \lim_ {t\to\infty}M_ t^\lambda$ is almost surely well-defined. Hence $M_ \tau^\lambda$ is well-defined independently of whether $\tau <\infty$ or not. Furthermore, let $Q_ t^\lambda := M_ {\tau\wedge t}^\lambda$ be a stopped version of $(M_ t^\lambda)_ t.$ By Fatou’s Lemma,

$$\begin{align*} \mathbb{E}[M_\tau^\lambda] = \mathbb{E}[\liminf_{t\to\infty} Q_t^\lambda] \leq \liminf_{t\to\infty}\mathbb{E}[Q_t^\lambda] \leq 1, \end{align*}$$

which concludes the proof.

Self-normalized bound for vector-valued martingales

Based on the properties of $\lbrace\eta_ t\rbrace_ {t=1}^\infty$ and $\lbrace X_ t\rbrace_ {t=1}^\infty,$ we can construct a vector-valued martingale

$$S_t=\sum_{s=1}^t\eta_s X_s$$

with respect to the filtration $\lbrace\mathcal{F}_ t\rbrace_ {t=1}^\infty.$

Assume that $V_ 0$ is a $d\times d$ positive definite matrix. For any $t\geq 0,$ let

$$V_t = V_0 + \sum_{s=1}^t X_sX_s^\top.$$

Lemma 2. Let $\tau$ be a stopping time with respect to the filtration $\lbrace\mathcal{F}_ t\rbrace_ {t=0}^\infty.$ Then, for $\delta >0,$ the following inequality holds with probability at least $1-\delta:$

$$ \Vert S_\tau\Vert_{V_\tau^{-1}}^2 \leq 2\sigma^2\log\left(\frac{\det(V_\tau)^{1/2}\det(V_0)^{-1/2}}{\delta}\right). $$

Proof. Let $H_t = V_t-V_0 = \sum_{s=1}^t X_sX_s^\top.$ By definition, $M_ t^\lambda$ can be represented as

$$M_t^\lambda = \exp\left(\frac{\lambda^\top S_t}{\sigma} - \frac{1}{2}\Vert\lambda\Vert_{H_t}^2\right).$$

Let $\Lambda\sim N(0,V_ 0^{-1})$ be a Gaussian variable which is independent of other variables. Define

$$M_t=\mathbb{E}[M_t^\Lambda | \mathcal{F}_\infty],$$

where $\mathcal{F}_ \infty = \bigcup_ {t=0}^\infty\mathcal{F}_ t.$ Clearly, $\mathbb{E}[M_ \tau] = \mathbb{E}\left[\mathbb{E}[M_ \tau^\Lambda\vert\Lambda]\right] \leq 1.$

Now we are going to compute $M_t.$ For a $d\times d$ positive definite matrix $P$ let $c(P) = \int \exp\left(-x^\top Px/2\right)\mathrm{d}x = \sqrt{(2\pi)^d/\det(P)},$ then

$$\begin{align*} M_t &= \int \exp\left(\frac{\lambda^\top S_t}{\sigma} - \frac{1}{2}\Vert\lambda\Vert_{H_t}^2\right)\phi(\lambda;0,V_0^{-1})\mathrm{d}\lambda\\ &= \int \exp\left(-\frac{1}{2}\left\Vert\lambda - \sigma^{-1}H_t^{\dagger}S_t\right\Vert_{H_t}^2 + \frac{1}{2}\Vert \sigma^{-1}S_t\Vert_{H_t^{\dagger}}^2\right)\phi(\lambda;0,V_0^{-1})\mathrm{d}\lambda\\ &= \frac{1}{c(V_0)}\exp\left(\frac{1}{2}\Vert \sigma^{-1}S_t\Vert_{H_t^{\dagger}}^2\right)\int\exp\left(-\frac{1}{2}\left\lbrace\left\Vert\lambda - \sigma^{-1}H_t^{\dagger}S_t\right\Vert_{H_t}^2 + \Vert\lambda\Vert_{V_0}^2\right\rbrace\right)\mathrm{d}\lambda. \end{align*}$$

It can be verified that

$$\begin{align*} &\left\Vert\lambda - \sigma^{-1}H_t^{\dagger}S_t\right\Vert_{H_t}^2 + \Vert\lambda\Vert_{V_0}^2 = \lambda^\top (V_0+H_t)\lambda - 2\sigma^{-1}\lambda^\top S_t + \sigma^{-2}S_t^\top H_t^{\dagger} S_t\\ &= \Vert \lambda - \sigma^{-1}V_t^{-1}S_t\Vert_{V_t}^2 + \Vert \sigma^{-1}S_t\Vert_{H_t^{\dagger}}^2 - \Vert \sigma^{-1}S_t\Vert_{V_t^{-1}}^2, \end{align*}$$

then

$$\begin{align*} M_t &= \frac{1}{c(V_0)}\exp\left(\frac{1}{2}\Vert \sigma^{-1}S_t\Vert_{V_t^{-1}}^2\right)\int\exp\left(-\frac{1}{2}\Vert \lambda - \sigma^{-1}V_t^{-1}S_t\Vert_{V_t}^2\right)\mathrm{d}\lambda\\ &= \frac{c(V_t)}{c(V_0)}\exp\left(\frac{1}{2}\Vert \sigma^{-1}S_t\Vert_{V_t^{-1}}^2\right)\\ &= \sqrt{\frac{\det(V_0)}{\det(V_t)}}\exp\left(\frac{1}{2}\Vert \sigma^{-1}S_t\Vert_{V_t^{-1}}^2\right). \end{align*}$$

Since $\mathbb{E}[M_ \tau] \leq 1,$ apply the Markov’s inequality:

$$\begin{align*} \delta \geq \frac{\mathbb{E}[M_\tau]}{1/\delta} &> \mathbb{P}\left\lbrace {M_\tau} > {1/\delta} \right\rbrace\\ &= \mathbb{P}\left\lbrace \sqrt{\frac{\det(V_0)}{\det(V_\tau)}}\exp\left(\frac{1}{2}\Vert \sigma^{-1}S_\tau\Vert_{V_\tau^{-1}}^2\right) > {1/\delta} \right\rbrace\\ &= \mathbb{P}\left\lbrace \Vert S_\tau\Vert_{V_\tau^{-1}}^2 > 2\sigma^2\log\left(\frac{\det(V_\tau)^{1/2}\det(V_0)^{-1/2}}{\delta}\right)\right\rbrace. \end{align*}$$

Then we concludes the proof.

Self-normalized tail inequality

Theorem 1. For $\delta >0,$ with probability at least $1-\delta,$ the following inequality holds simultaneously for all $t\geq 0:$

$$\Vert S_t\Vert_{V_t^{-1}}^2 \leq 2\sigma^2\log\left(\frac{\det(V_t)^{1/2}\det(V_0)^{-1/2}}{\delta}\right).$$

Proof. Use a stopping time construction. Define the bad event

$$B_t(\delta) = \left\lbrace\omega\in\Omega: \Vert S_t\Vert_{V_t^{-1}}^2 > 2\sigma^2\log\left(\frac{\det(V_t)^{1/2}\det(V_0)^{-1/2}}{\delta}\right)\right\rbrace.$$

We are intrested in bounding the probability that $\bigcup_{t\geq 0}B_t(\delta)$ happens. Define $\tau(\omega)=\min\lbrace t\geq 0:\omega\in B_t(\delta)\rbrace$ with the convention that $\min\emptyset=\infty.$ Then $\tau$ is a stopping time. Furthermore,

$$\bigcup_{t\geq 0}B_t(\delta) = \lbrace \omega:\tau(\omega)<\infty\rbrace.$$

Then, by Lemma 2

$$\begin{align*} \mathbb{P}\left\lbrace\bigcup_{t\geq 0}B_t(\delta)\right\rbrace &= \mathbb{P}\lbrace\tau < \infty\rbrace\\ &= \mathbb{P}\left\lbrace\Vert S_\tau\Vert_{V_\tau^{-1}}^2 > 2\sigma^2\log\left(\frac{\det(V_\tau)^{1/2}\det(V_0)^{-1/2}}{\delta}\right),\tau < \infty\right\rbrace\\ &\leq \mathbb{P}\left\lbrace\Vert S_\tau\Vert_{V_\tau^{-1}}^2 > 2\sigma^2\log\left(\frac{\det(V_\tau)^{1/2}\det(V_0)^{-1/2}}{\delta}\right)\right\rbrace\\ &\leq \delta. \end{align*}$$

Thus we conclude the proof.

Remark. The norm of the martingale $S_t$ is weighted by the matrix $V_t^{-1}$ which is itself derived from the martingale. That is the meaning of “self-normalized bound”.

Self-normalized Processes in RKHS

The conclusion above can be extended to a RKHS case.

Theorem 2. Given $\epsilon > 0$ and $\delta >0,$ with probability at least $1-\delta,$ the following inequality holds simultaneously for all $t\geq 0:$

$$\Vert \eta_{1:t}\Vert_{\left((K_t+\epsilon I)^{-1} + I\right)^{-1}}^2 \leq 2\sigma^2\log\left(\frac{\det\left((1 + \epsilon)I + K_t\right)^{1/2}}{\delta}\right),$$

where $k:\mathbb{R}^d\times\mathbb{R}^d\to\mathbb{R}$ is a symmetric and positive definite kernel and $K_t = \lbrace k(x_ i,x_ j)\rbrace_ {i,j=1}^n.$

Proof. Analogous to Lemma 1, for a function $g:\mathbb{R}^d\to\mathbb{R}$ we construct a supermartingale $\lbrace M_t^g\rbrace_ {t=0}^\infty$ with respect to $\lbrace\mathcal{F}_ t\rbrace_ {t=0}^\infty:$

$$ M_t^g = \exp\left(\sum_{s=1}^t\left[\frac{\eta_sg(x_s)}{\sigma} - \frac{1}{2}g(x_s)^2\right]\right).$$

It can be verified that $\lbrace M_t^g\rbrace_ {t=0}^\infty$ is indeed a supermartingale using the sub-Gaussianity of $\eta_t.$ Then for a stopping time $\tau$ with respect to the filtration $\lbrace\mathcal{F}_ t\rbrace_ {t=0}^\infty,$ we have $\mathbb{E}[M_ \tau^g]\leq 1.$

Now, suppose that $h$ is a random function distributed according to a Gaussian process $\mathcal{GP}\left(0,k(\cdot,\cdot) + \epsilon\mathbb{1}(\cdot=\cdot)\right).$ Let $\mathbf{h} = (h(x_ 1),\cdots,h(x_ t)),$ we have $\mathbf{h}\sim N(0,K_ t + \epsilon I).$

For each $t\geq 0,$ define

$$M_t = \mathbb{E}[M_t^h|\mathcal{F}_\infty],$$

then $\mathbb{E}[M_ \tau] = \mathbb{E}\left[\mathbb{E}[M_ \tau^h\vert h]\right] \leq 1.$

Furthermore, $M_t$ can be computed:

$$\begin{align*} M_t &= \frac{\exp\left({\Vert\eta_{1:t}\Vert^2}/{2\sigma^2}\right)}{\sqrt{(2\pi)^t\det(K_t + \epsilon I)}}\int\exp\left(-\frac{\left\Vert\mathbf{h} - \sigma^{-1}\eta_{1:t}\right\Vert^2 + \Vert\mathbf{h}\Vert_{(K_t + \epsilon I)^{-1}}^2}{2}\right)\mathrm{d}\mathbf{h}\\ &= \frac{1}{\sqrt{(2\pi)^t\det(K_t + \epsilon I)}}\exp\left(\frac{1}{2\sigma^2}\Vert\eta_{1:t}\Vert^2_{\left(I+(K_t+\epsilon I)^{-1}\right)^{-1}}\right)\\ &\qquad \times \int\exp\left(-\frac{1}{2}\left\Vert\mathbf{h} - \sigma^{-1}\left(I+(K_t+\epsilon I)^{-1}\right)^{-1}\eta_{1:t}\right\Vert^2_{I+(K_t+\epsilon I)^{-1}}\right) \mathrm{d}\mathbf{h}\\ &= \sqrt{\frac{\det(I+(K_t + \epsilon I)^{-1})^{-1}}{\det(K_t + \epsilon I)}}\exp\left(\frac{1}{2\sigma^2}\Vert\eta_{1:t}\Vert^2_{\left(I+(K_t+\epsilon I)^{-1}\right)^{-1}}\right)\\ &= \frac{1}{\sqrt{\det\left((1 + \epsilon)I + K_t\right)}}\exp\left(\frac{1}{2\sigma^2}\Vert\eta_{1:t}\Vert^2_{\left(I+(K_t+\epsilon I)^{-1}\right)^{-1}}\right). \end{align*}$$

Plug in this representation of $M_ \tau$ to the Markov inequality

$$\mathbb{P}\left(M_\tau > 1/\delta\right) < \frac{\mathbb{E}[M_\tau]}{1/\delta} \leq \delta,$$

we have

$$\mathbb{P}\left(\Vert \eta_{1:\tau}\Vert_{\left((K_\tau+\epsilon I)^{-1} + I\right)^{-1}}^2 > 2\sigma^2\log\left(\frac{\det\left((1 + \epsilon)I + K_\tau\right)^{1/2}}{\delta}\right)\right) < \delta.$$

Then use a similar stopping time construction in the proof of Theorem 1, we can generalize the inequality above to all $t\geq 0,$ which concludes the proof.

Remark.

If $K_ t$ is positive definite, the one can let $\epsilon=0.$
An alternative view of the self-normalized process in Theorem 2 is based on the feature space induced by the kernel. Let $\mathcal{H}$ be the RKHS with $k$ as its reproducing kernel, and let $\psi:\mathbb{R}^d\to\mathcal{H}$ be the associated feature map, i.e. $\langle\psi(x),\psi(y)\rangle_\mathcal{H} = k(x,y)$ for $x,y\in\mathbb{R}^d.$ Define
$$S_t=\sum_{s=1}^t \eta_s\psi(x_s),\ \ V_t = \mathcal{I} + \sum_{s=1}^t\psi(x_s)\psi(x_s)^\top.$$

where $V_t:\mathcal{H}\to\mathcal{H}$ is a linear mapping such that for $z\in\mathcal{H},$ $V_ t(z) = z + \sum_ {s=1}^t\psi(x_ s)\langle \psi(x_ s), z\rangle_ \mathcal{H}.$ Then, whenever $K_ t$ is positive definite, we have

$$\Vert \eta_{1:t}\Vert_{(I + K_t^{-1})^{-1}} = \Vert V_t^{-1/2}S_t\Vert_\mathcal{H}.$$

Proof. Denote $\Psi = (\psi(x_ 1),\cdots,\psi(x_ t))^\top,$ then we have $S_k = \Psi^\top\eta_ {1:t},\ V_t = \mathcal{I} + \Psi^\top\Psi,\ K_t = \Psi\Psi^\top,$

$$\begin{align*} \Vert V_t^{-1/2}S_t\Vert_\mathcal{H}^2 &= S_t^\top V_t^{-1} S_t\\ &= \eta_{1:t}^\top\Psi V_t^{-1}\Psi^\top\eta_{1:t}\\ &= \eta_{1:t}^\top\Psi(I+\Psi^\top\Psi)^{-1}\Psi^\top\eta_{1:t}\\ &= \eta_{1:t}^\top\Psi\Psi^\top(I+\Psi\Psi^\top)^{-1}\eta_{1:t}\\ &= \eta_{1:t}^\top K_t(I+K_t)^{-1}\eta_{1:t}\\ &= \eta_{1:t}^\top (I + K_t^{-1})^{-1} \eta_{1:t}. \end{align*}$$

Thus we conclude the proof.

References

Abbasi-Yadkori, Y., Pál, D. and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems.
Chowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. arXiv preprint arXiv:1704.00445.

Pessimism in Offline RL

Contextual Bandits with Confounding Bias and Missing Observations