Pessimism in Offline RL

Setting. Consider an episodic MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},H,\mathcal{P},r)$ with the state space $\mathcal{S},$ action space $\mathcal{A},$ horizon $H,$ transition kernel $\mathcal{P}=\lbrace\mathcal{P}_ h\rbrace_ {r=1}^H,$ and reward function $r=\lbrace r_ h:\mathcal{S}\times\mathcal{A}\to [0,1]\rbrace_ {h=1}^H.$ For any policy $\pi=\lbrace \pi_h\rbrace_ {h=1}^H,$ define the value function $V_ h^\pi:\mathcal{S}\to\mathbb{R}$ and the Q-value function $Q_ h^\pi:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ at each step $h\in[H]$ as

$$\begin{align*} V_h^\pi(x) &= \mathbb{E}_\pi\left[\left.\sum_{t=h}^H r_t(s_t,a_t)\right|s_h=x\right],\\ Q_h^\pi(x,a) &= \mathbb{E}_\pi\left[\left.\sum_{t=h}^H r_t(s_t,a_t)\right|s_h=x,a_h=a\right], \end{align*}$$

where $\mathbb{E}_ \pi$ is taken with respect to the probability measure of the trajectory $\tau=\lbrace(s_ h,a_ h)\rbrace_ {h=1}^H$ induced by $\pi,$ i.e. at each step $h\in[H],$

$$a_h\sim\pi_h(\cdot\vert s_h),\ s_{h+1}\sim\mathcal{P}_h(\cdot|s_h,a_h).$$

For any function $V:\mathcal{S}\to\mathcal{R},$ Define the transition operator $\mathbb{P}_ h$ and the Bellman operator $\mathbb{B}_ h$ at each step $h\in[H]$ as follows:

$$\begin{align*} (\mathbb{P}_h V)(x,a) &:= \mathbb{E}_{s'\sim\mathcal{P}_h(\cdot|x,a)} V(s'),\\ (\mathbb{B}_h V)(x,a) &:= r_h(x,a) + (\mathbb{P}_h V)(x,a). \end{align*}$$

Let $\pi^{* }$ be the optimal policy in this MDP and $\lbrace Q^{* }_ h\rbrace_ {h=1}^H, \lbrace V^{* }_ h\rbrace_ {h=1}^H$ the corresponding value and Q-value functions. Then we have the Bellman optimality equation

$$V_h^*(x) = \max_{a\in\mathcal{A}}Q_h^*(x,a),\ Q_h^*(x,a) = (\mathbb{B}_hV_{h+1}^*)(x,a).$$

Goal. We aim to learn a policy that maximizes the expected cumulative reward. Given a policy $\pi,$ define its suboptimality as

$$\mathrm{SubOpt}(\pi;x) = V_1^*(x) - V_1^\pi(x),$$

which is the performance metric in the following discussion.

Offline Data Collection

In the offline setting, an agent has only access to a dataset $\mathcal{D}$ consisting of $K$ trajectories $\left\lbrace(x_ h^\tau, a_ h^\tau, r_ h^\tau)\right\rbrace_ {\tau=1,h=1}^{K,H}$ collected a priori by an experimenter. For the data collecting process, we assume the following condition.

Assumption 1. (Compliance). The dataset $\mathcal{D}$ that the learner has access to is compliant with the underlying MDP $\mathcal{M}=(\mathcal{S},\mathcal{A},H,\mathcal{P},r).$ That is,

$$\begin{align*} \mathbb{P}_\mathcal{D}\left(r_h^\tau = r',x_{h+1}^\tau=x'|\lbrace(x_h^j,a_h^j)\rbrace_{j=1}^\tau, \lbrace(r_h^j,x_{h+1}^j)\rbrace_{j=1}^{\tau-1}\right)\\ = \mathbb{P}_\mathcal{M}\left(r_h(s_h,a_h)=r',s_{h+1}=x'|s_h=x_h^\tau,a_h=a_h^\tau\right), \end{align*}$$

for all $r’\in[0,1],x’\in\mathcal{S},h\in[H],\tau\in[K],$ where $\mathbb{P}_ \mathcal{D}$ is the joint distribution of the data collecting process, and $\mathbb{P}_ \mathcal{M}$ the underlying MDP.

Remark. The compliance assumption ensures the Markov perperty of the data collecting process, that $(r_ h^\tau, s_ {h+1}^\tau)$ is dependent on $\lbrace(x_ h^j,a_ h^j)\rbrace_{j=1}^\tau, \lbrace(r_ h^j,x_ {h+1}^j)\rbrace_{j=1}^{\tau-1}$ only through $(x_ h^\tau, a_ h^\tau).$ Moreover, they are generated by the reward function and transition kernel of the underlying MDP.

Decomposition of Suboptimality

Consider a meta algorithm which yields estimated value and Q-value functions $\widehat{V}_ h:\mathcal{S}\to\mathbb{R}$ and $\widehat{Q}_ h:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ based ont the dataset $\mathcal{D}$. We denote $\widehat{\pi}$ by the policy correponding to $\widehat{V}_ h$ and $\widehat{Q}_ h$ in sense that $\widehat{V}_ h(x) = \langle\widehat{Q}_ h(x,\cdot),\widehat{\pi}(\cdot\vert x)\rangle_ \mathcal{A}.$ Define the model evaluation error at each step $h\in[H]$ as

$$\iota_h(x,a)=(\mathbb{B}_h\widehat{V}_{h+1})(x,a)-\widehat{Q}_h(x,a).$$

For any policy $\pi=\lbrace\pi_ h\rbrace_ {h=1}^H,$ we have

$$\begin{align*} &\sum_{h=1}^H\mathbb{E}_\pi[\iota_h(s_h,a_h)|s_1=x]\\ &= \sum_{h=1}^H\mathbb{E}_\pi[r_h(s_h,a_h) + (\mathbb{P}_h \widehat{V}_{h+1})(s_h,a_h) - \widehat{Q}_h(s_h,a_h)|s_1=x]\\ &= \sum_{h=1}^H\mathbb{E}_\pi[r_h(s_h,a_h) + \widehat{V}_{h+1}(s_{h+1}) - \widehat{Q}_h(s_h,a_h)|s_1=x]\\ &= \sum_{h=1}^H\mathbb{E}_\pi[r_h(s_h,a_h) + \langle\widehat{Q}_{h+1}(s_{h+1},\cdot),\widehat{\pi}(\cdot|s_{h+1})\rangle_\mathcal{A} - \langle\widehat{Q}_h(s_h,\cdot),\pi(\cdot|s_h)\rangle_\mathcal{A}|s_1=x], \end{align*}$$

note that $\widehat{Q}_{H+1}=0,$ then we can rearrange the last two terms in the expectation expression. Hence

$$\begin{align*} &\sum_{h=1}^H\mathbb{E}_{\pi}[\iota_h(s_h,a_h)|s_1=x] \\ &= \sum_{h=1}^H\mathbb{E}_\pi[r_h(s_h,a_h)|s_1=x] - \langle\widehat{Q}_{1}(x,\cdot),\widehat{\pi}(\cdot|x)\rangle_\mathcal{A}\\ &\qquad+ \sum_{h=1}^{H} \mathbb{E}[\langle\widehat{Q}_{h}(s_{h},\cdot),\widehat{\pi}(\cdot|s_{h}) - \pi(\cdot|s_{h})\rangle_\mathcal{A}|s_1=x]\\ &= V_1^\pi(x) - \widehat{V}_1(x) + \sum_{h=1}^{H} \mathbb{E}[\langle\widehat{Q}_{h}(s_{h},\cdot),\widehat{\pi}(\cdot|s_{h}) - \pi(\cdot|s_{h})\rangle_\mathcal{A}|s_1=x]. \end{align*}$$

With the equation above, we can easily verify the following conclusion:

Lemma 1. Let $\widehat{\pi}=\lbrace\widehat{\pi}_ h\rbrace_ {h=1}^H$ be the policy such that $\widehat{V}_ h(x) = \langle \widehat{Q}_ h(x,\cdot),\widehat{\pi}_ h(\cdot|x)\rangle_ \mathcal{A}.$ Then for any $\hat{\pi}$ and $x\in\mathcal{S},$ the suboptimality of $\hat{\pi}$ admits the following decomposition:

$$\begin{align*} \mathrm{SubOpt}(\widehat{\pi};x) = &\underset{\text{(i) Spurious Correlation}}{\underbrace{-\sum_{h=1}^H\mathbb{E}_{\widehat{\pi}}[\iota_h(s_h,a_h)|s_1=x]}} + \underset{\text{(ii) Intrinsic Uncertainty}}{\underbrace{\sum_{h=1}^H\mathbb{E}_{\pi^*}[\iota_h(s_h,a_h)|s_1=x]}}\\ &+ \underset{\text{(iii) Optimization Error}}{\underbrace{\sum_{h=1}^{H}\mathbb{E}[\langle\widehat{Q}_{h}(s_{h},\cdot),{\pi^*}(\cdot|s_{h}) - \widehat{\pi}(\cdot|s_{h})\rangle_\mathcal{A}|s_1=x]}}. \end{align*}$$

Remark. In this decomposition,

Term (i) is difficult to control, since $\widehat{\pi}$ and $\iota_h$ simultaneously depend on $\mathcal{D}.$
Term (ii) is easier to control, because $\pi^*$ is intrinsic to the underlying MDP and does not depend on $\mathcal{D}.$
Term (iii) is nonpositive, because $\widehat{\pi}$ is greedy with respect to $\widehat{Q}_ h,$ i.e. $\widehat{\pi}_ h(\cdot\vert x) = \underset{\pi_ h}{\mathrm{argmax}}\langle\widehat{Q}_ h(x,\cdot),\pi_ h(\cdot\vert x)\rangle_ \mathcal{A}.$

Pessimistic Value Itervation (PEVI)

Jin et al. (2022) proposed a pessimistic method to learn a policy in an offline RL problem. The algorithm, called Pessimistic Value Itervation (PEVI), flips the sign of the bonus function for promoting exploration in online RL.

The algorithm constructs the estimated functions $\widehat{V}_ h,\widehat{Q}_ h$ and Bellman operator $\widehat{\mathbb{B}}_ h.$ The operator $\widehat{\mathbb{B}}_ h$ can be implicit since the algorithm only uses $\widehat{\mathbb{B}}_ h\widehat{V}_ {h+1}.$ To eliminate the spurious correlation from the suboptimality, an uncertainty quantifier with the confidence parameter $\xi\in(0,1)$ is defined:

Definition 1 ($\xi$-uncertainty quantifier). $\lbrace\Gamma_h:\mathcal{S}\times\mathcal{A}\to\mathbb{R}\rbrace_ {h=1}^H$ is called a $\xi$-uncertainty quantifier with respect to $\mathbb{P}_\mathcal{D},$ if the event

$$\mathcal{E} = \left\lbrace\left\vert(\widehat{\mathbb{B}}_h\widehat{V}_{h+1})(x,a) - (\mathbb{B}_h\widehat{V}_{h+1})(x,a)\right\vert\leq \Gamma_h(x,a)\ \forall (x,a)\in\mathcal{S}\times\mathcal{A},h\in[H]\right\rbrace$$

satisfies $\mathbb{P}_ \mathcal{D}(\mathcal{E}) \geq 1-\xi.$

Based on this definition, a meta-algorithm for general MDP is given below.

Initialize $\widehat{V}_{H+1}(\cdot)=0.$
For $h=H,H-1,\cdots,1$ do
1. Estimation: Construct $(\widehat{\mathbb{B}}_ h\widehat{V}_ {h+1})(\cdot,\cdot)$ and $\Gamma_ h(\cdot,\cdot)$ based on $\mathcal{D}.$
2. Pessimism: $\overline{Q}_ h(\cdot,\cdot)\gets (\widehat{\mathbb{B}}_ h\widehat{V}_ {h+1})(\cdot,\cdot) - \Gamma_ h(\cdot,\cdot).$
3. Truncation: $\widehat{Q}_ h(\cdot,\cdot) = \min\left\lbrace \overline{Q}_ h(\cdot,\cdot), H-h+1 \right\rbrace^+.$
4. Optimization: $\widehat{\pi}_ h(\cdot\vert\cdot)\gets\mathrm{argmax}_ {\pi_h}\langle Q(\cdot,\cdot),\pi_h(\cdot\vert\cdot)\rangle_ {\mathcal{A}}.$
5. Evaluation: $\widehat{V}_ h(\cdot) \gets \langle\widehat{Q}_ h(\cdot,\cdot),\widehat{\pi}_ h(\cdot,\cdot)\rangle_ \mathcal{A}.$
Output $\mathtt{Pess}(\mathcal{D}) = \lbrace\widehat{\pi}_ h\rbrace_ {h=1}^H.$

Suppose that $\lbrace\Gamma_h\rbrace_ {h=1}^H$ produced by PEVI is a $\xi$-uncertainty quantifier, then we can bound the suboptimality of the output policy $\mathtt{Pess}(\mathcal{D})$ with high probability.

Lemma 2 (Pessimism for general MDP). Suppose that $\lbrace\Gamma_h\rbrace_ {h=1}^H$ produced by PEVI is a $\xi$-uncertainty quantifier. If $\mathcal{E}$ holds, then

$$0\leq\iota_h(x,a)\leq 2\Gamma_h(x,a)\ \ \forall (x,a)\in\mathcal{S}\times\mathcal{A},h\in[H].$$

Proof. The first inequality: In the truncation step,

$$\widehat{Q}_h(x,a) = \min\left\lbrace \overline{Q}_h(x,a), H-h+1 \right\rbrace^+\leq \max\lbrace 0,\overline{Q}_h(x,a)\rbrace,$$

then

$$\begin{align*} \iota_h(x,a) &= (\mathbb{B}_h\widehat{V}_{h+1})(x,a) - \widehat{Q}_h(x,a) \\ &\geq \min\left\lbrace (\mathbb{B}_h\widehat{V}_{h+1})(x,a), (\mathbb{B}_h\widehat{V}_{h+1})(x,a) - \overline{Q}_h(x,a) \right\rbrace\\ &\text{(Plug in the pessimism step)}\\ &= \min\left\lbrace (\mathbb{B}_h\widehat{V}_{h+1})(x,a), (\mathbb{B}_h\widehat{V}_{h+1})(x,a) - (\widehat{\mathbb{B}}_h\widehat{V}_{h+1})(x,a) + \Gamma_h(x,a) \right\rbrace\\ &\geq 0. \end{align*}$$

The last inequality holds because (i) $\widehat{V}_h,h\in[H]$ is nonnegative by its construction and (ii) $\mathcal{E}$ holds.

The second inequality: In the pessimism step,

$$\begin{align*} \overline{Q}_h(x,a) &= (\widehat{\mathbb{B}}_ h\widehat{V}_{h+1})(x,a) - \Gamma_ h(x,a)\\ &\leq (\mathbb{B}_h\widehat{V}_{h+1})(x,a)\\ &\leq 1 + \sup_{x'\in\mathcal{S}}\widehat{V}_{h+1}(x')\\ &\leq \cdots \leq H-h+1. \end{align*}$$

Then

$$\widehat{Q}_h(x,a) = \min\left\lbrace \overline{Q}_h(x,a), H-h+1 \right\rbrace^+\geq \overline{Q}_h(x,a),$$

and

$$\begin{align*} \iota_h(x,a) &= (\mathbb{B}_h\widehat{V}_{h+1})(x,a) - \widehat{Q}_h(x,a) \\ &\leq (\mathbb{B}_h\widehat{V}_{h+1})(x,a) - \overline{Q}_h(x,a) \\ &= (\mathbb{B}_h\widehat{V}_{h+1})(x,a) - (\widehat{\mathbb{B}}_h\widehat{V}_{h+1})(x,a) + \Gamma_h(x,a)\\ &\leq 2\Gamma_h(x,a). \end{align*}$$

In summary,

$$0\leq\iota_h(x,a)\leq 2\Gamma_h(x,a)\ \ \forall (x,a)\in\mathcal{S}\times\mathcal{A},h\in[H].$$

Theorem 1 (Suboptimality for general MDP). Suppose that $\lbrace\Gamma_h\rbrace_ {h=1}^H$ produced by PEVI is a $\xi$-uncertainty quantifier. If $\mathcal{E}$ holds, then

$$\mathrm{SubOpt}(\mathtt{Pess}(\mathcal{D});x) \leq 2\sum_{h=1}^H\mathbb{E}_{\pi^*}[\Gamma_h(s_h,a_h)|s_1=x].$$

The proof is straightforward. Write the decomposition given by Lemma 1 and bound the terms (i) and (ii) with Lemma 2. Using the $\xi$-quantifier, the spurious correlation in term (i) is eliminated, and an upper bound only corresponding to term (ii) is established.

References

Jin, Y., Yang, Z. and Wang, Z., 2021, July. Is pessimism provably efficient for offline rl?. In International Conference on Machine Learning (pp. 5084-5096). PMLR.

Reinforcement Learning: Multi-Armed Bandit Problem with Unobserved Confounders

Concentration of Self-Normalized Processes