Contextual Bandits with Confounding Bias and Missing Observations
Offline Policy Learning in Confounded Contextual Bandit (CCB)
In a confounded contextual bandit, each trial can be represented by a 5-tuple $(U,X,A,Y,O),$ where
- $U\in\mathcal{U}$ is the set of unobserved confounders,
- $X\in\mathcal{X}$ is the context,
- $A\in\mathcal{A}$ is the treatment or action,
- $Y\in\mathbb{R}$ is the reward, and
- $O\in\mathcal{O}$ are the side observations.
Offline data collection
The collection of offline dataset $\mathcal{D}$ is described by the observational process.
- In the $t^\text{th}$ trial, the environment samples $(u_ t,x_ t,o_ t)$ as a realization of $(U,X,O)$ according to the prior $p(u,x,o).$
- The agent then conducts a treatment according to the observational policy $\pi_ \mathrm{obs}:\mathcal{X}\times\mathcal{O}\times\mathcal{U}\to\Delta(\mathcal{A}).$ This policy is confounded by $U,$ which can be viewed as a predilection for certain treatment due to some hidden causes encoded by $U.$
- The agent receives a reward $Y$ conditioning on $(U,X,O,A).$ The joint distribution in the observational process is given by
$$p_\mathrm{obs}(u,x,o,a,y)=p(u,x,o)\pi_\mathrm{obs}(a|u,x,o)p(y|u,x,o,a).$$
With $T$ trials in the observational process, the full dataset $\tilde{\mathcal{D}} = \lbrace(u_ t,x_ t,o_ t,a_ t, y_ t)\rbrace_ {t=1}^T$ is not available due to the unmeasurement of confounders $u_ t$ and the missingness in both contexts $x_ t$ and side observations $o_ t.$
Missing Mechanism. Denote the collected dataset as $\mathcal{D}=\lbrace(\check{x}_ t, a_ t, y_ t, \check{o}_ t)\rbrace_ {t=1}^T,$ where $\check{x}_ t$ and $\check{o}_ t$ encode the missingness. Let $R_ X$ and $R_ O$ denote the missing indicators for $X$ and $O,$ respectively, and $r_ {X,t},r_ {O,t}$ their realizations in the $t^\text{th}$ trial. With a dummy value $\mathrm{None}$ introduced to represent a missing record, we have
$$\check{x_t}=\begin{cases} x_t,\ \text{if}\ r_{X,t}=1\\ \mathrm{None},\ \text{if}\ r_{X,t}=0 \end{cases}\qquad \check{o_t}=\begin{cases} o_t,\ \text{if}\ r_{O,t}=1\\ \mathrm{None},\ \text{if}\ r_{O,t}=0. \end{cases}$$
The missingness of context $X$ and side observation $O$ are supposed to be not at random but independent of outcome $Y.$ That is, $R_ X$ and $R_ O$ are not independent of $(U,X,O,A,Y),$ but are independent of $Y.$
Interventional Process
In the interventional process, an interventional policy is carried out after the model is learned from the offline dataset. The interventional process is different from the observational process in the following three aspects:
- side observations $O$ are unmeasurable while context $X$ is fully measurable;
- the agent adopts an interventional policy $\pi:\mathcal{X}\to\Delta(\mathcal{A})$ independent of unmeasurable $U$ and $O;$
- context $X$ follows a new marginal distribution $\tilde{p}(x).$
Then the joint distribution $p^\pi_\mathrm{in}$ of random variables in the interventional process is given by
$$p^\pi_\mathrm{in}(u,x,o,a,y) = \tilde{p}(x)p(u,o|x)\pi(a|x)p(y|u,x,o,a).$$
The underlying DAGs of observational and interventional processes are presented below.

Goal. In the interventional process, the average reward $v^\pi$ is defined as
$$v^\pi = \mathbb{E}_{p^\pi_\mathrm{in}}[Y],$$
where $\mathbb{E}_ {p^\pi_ \mathrm{in}}$ corresponds to the expectation taken with respect to $p^\pi_ \mathrm{in}.$ The goal of the agent is to find an optimal policy $\pi\in\Pi:\mathcal{X}\to\Delta(\mathcal{A})$ which maximizes the average reward:
$$\pi^* = \underset{\pi\in\Pi}{\mathrm{argmax}}\ v^\pi.$$
The performance metric of a policy $\pi$ is defined as its suboptimality:
$$\mathrm{SubOpt}(\pi) = v^{\pi^*} - v^\pi.$$
Causal-Adjusted Pessimistic (CAP) Policy Learning
Average reward evaluation
Define the conditional average treatment effect (CATE) as
$$g^*(a,x)=\mathbb{E}_\mathrm{obs}[Y|X=x,\mathrm{do}(A=a)],$$
then the average reward can be learned via
$$v^\pi = \mathbb{E}_{p_\mathrm{in}^\pi}[g^*(A,X)].$$
Policy Optimization
Let $g$ denote an estimation of the CATE and $g^*$ denote the exact CATE thereafter. The average reward function corresponding to $g$ and $\pi$ is defined as
$$v(g,\pi) = \mathbb{E}_{p_\mathrm{in}^\pi}[g(A,X)].$$
Then a greedy policy $\widehat{\pi}$ can be obtained. However, this policy often suffers from the spurious correlation between $g$ and $\widehat{\pi}.$ The technique of uncertainty quantification and pessimism can be adopted to address with this spurious correlation.
We first construct a confidence set $\mathrm{CI}_ \mathcal{D}$ to which $g^{* }$ belongs with high probability. By optimizing the policy with respect to $g\in\mathrm{CI}_ \mathcal{D}$ that minimizes $v(g,\cdot),$ we have $v(g,\widehat{\pi})\leq v(g^{* }, \widehat{\pi})$ and thus eliminate the spurious correlation. Hence, the estimated policy is given by
$$\widehat{\pi} = \underset{\pi\in\Pi}{\mathrm{argsup}}\inf_{g\in\mathrm{CI}_\mathcal{D}} v(g,\pi).$$
Causal-adjusted pessimistic (CAP) algorithm
Having identified CATE from an integral equation system (IES), we can construct the confidence set $\mathrm{CI}_ \mathcal{D}$ from a hypothesis class $\mathcal{H}$ by minimizing some empirical loss function $\mathcal{L}_ \mathcal{D}:\mathcal{H}\to\mathbb{R}.$ An overview of the causal-adjusted pessimistic algorithm is presented below.
- Input: dataset $\mathcal{D}=\lbrace\check{x}_ t,a_ t,y_ t,\check{o}_ t)_ {t=1}^T,$ hypothesis class $\mathcal{H},$ policy class $\Pi,$ threshold $e_ \mathcal{D}.$
- Construct the confidence set $\mathrm{CI}_ \mathcal{D}(e_ \mathcal{D})$ as the level set of $\mathcal{H}$ with respect to metric $\mathcal{L}_ \mathcal{D}(\cdot)$ and threshold $e_ \mathcal{D}.$
- $\widehat{\pi} = \mathrm{argsup}_ {\pi\in\Pi}\inf_ {g\in\mathrm{CI}_ \mathcal{D}(e_ \mathcal{D})} v(g,\pi).$
- Output: $\widehat{\pi}$.
Causal Identification of CATE
CCB with Instrumental Variable (CCB-IV)
The model for CCB-IV is illustrated in the following figure.

Assumption 1. We assume that the following assumptions hold for the observational process of the CCB-IV.
- (Structural reward). $Y=f(A,X) + \epsilon$ where $\epsilon\perp A\ \vert\ (X,U);$
- (IV completeness). $\mathbb{E}_ \mathrm{obs}[\sigma(X,A)\vert Z=z, R_ Z=1]=0$ holds for all $z\in\mathcal{Z}$ if and only if $\sigma(X,A)=0$ holds almost surely;
- (IV independence). $Z\perp (U,\epsilon)\ \vert X;$
- (Unconfounded and outcome-independent missingness). We allow $R_ X$ to be caused by $(Z,X,A)$ and $R_Z$ to be caused by $(Z,X).$
Theorem 1. Suppose that Assumption 1 holds. If there exist functions $h_1:\mathcal{Y}\times\mathcal{A}\times\mathcal{Z}\to\mathbb{R}$ and $g:\mathcal{A}\times\mathcal{X}\to\mathbb{R}$ satisfying:
$$\begin{align*} &\mathbb{E}_\mathrm{obs}[h_1(Y,A,Z) - Y|Z,R_Z=1] = 0 \ \ \mathrm{a.s.},\\ &\mathbb{E}_\mathrm{obs}[g(A,X) - h_1(Y,A,Z)|Z,A,X,(R_Z,R_X)=\mathbf{1}] = 0\ \ \mathrm{a.s.}, \end{align*}$$
it follows that $g(A,X)\overset{\mathrm{a.s.}}{=} g^{* }(A,X)$ where $g^{* }$ is the CATE.
Proof. We first compute $\mathbb{E}_ \mathrm{obs}[Y\vert Z,R_ Z=1]:$
$$\begin{align*} \mathbb{E}_\mathrm{obs}[Y|Z,R_Z=1] &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[Y|Z,X]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[f(A,X) + \epsilon |Z,X]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}\left[f(A,X) + \mathbb{E}_\mathrm{obs}[\epsilon |Z,X]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}\left[f(A,X) + \mathbb{E}_\mathrm{obs}[\epsilon|X]|Z,R_Z=1\right].\\ \end{align*}$$
The first equality follows from $Y\perp R_Z\vert Z,X,$ the second from the structural equation of $Y,$ the third from $A\perp R_Z\vert Z,X,$ and the fourth from $\epsilon\perp Z\vert X.$
On the other hand,
$$\begin{align*} \mathbb{E}_\mathrm{obs}[Y|Z,R_Z=1] &= \mathbb{E}_\mathrm{obs}[h_1(Y,A,Z)|Z,R_Z=1]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[h_1(Y,A,Z)|Z,A,X,(R_Z,R_X)=\mathbf{1}]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[g(A,X)|Z,A,X,(R_Z,R_X)=\mathbf{1}]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}[g(A,X)|Z,R_Z=1]. \end{align*}$$
Hence $\mathbb{E}_ \mathrm{obs}\left[g(A,X) - f(A,X) - \mathbb{E}_ \mathrm{obs}[\epsilon\vert X]\vert Z,R_Z=1\right] = 0,$ and $g(A,X) = f(A,X) + \mathbb{E}_\mathrm{obs}[\epsilon\vert X]$ by IV completeness.
Now consider the CATE:
$$\begin{align*} g^*(a,x) &= \mathbb{E}_\mathrm{in}[Y|X=x,\mathrm{do}(A=a)]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[Y|U,X=x,A=a]|X=x\right]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[f(A,X) + \epsilon|U,X=x,A=a]|X=x\right]\\ &= \mathbb{E}_\mathrm{obs}\left[f(a,x) + \mathbb{E}_\mathrm{obs}[\epsilon|U,X=x,A=a]|X=x\right]\\ &= f(a,x) + \mathbb{E}_\mathrm{obs}[\epsilon|X=x]. \end{align*}$$
Hence
$$g^*(A,X) = f(A,X) + \mathbb{E}_\mathrm{obs}[\epsilon|X] \overset{\mathrm{a.s.}}{=} g(A,X),$$
which concludes the proof.
Claim 1. A solution $h=(h_1, g)$ to the IES in Theorem 1 exists if and only if there exists a solution $h_1$ to the following equation:
$$\mathbb{E}_\mathrm{obs}[g^*(A,X) - h_1(Y,A,Z)| A,X,Z,R_Z=1] = 0\ \ \mathrm{a.s.},$$
where $g^*$ is the exact CATE.
Proof. (SUFFICIENCY). Let $\tilde{g}=g^{* }$ and $\tilde{h}_1$ be the solution of the equation above. It suffices to show that $\tilde{g}$ and $\tilde{h}_1$ satisfies the integral equation system in Theorem 1, hence is a solution. The first equation holds directly. For the second equation,
$$\begin{align*} &\mathbb{E}_\mathrm{obs}[\tilde{h}_1(Y,A,Z)|Z,R_Z=1]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[\tilde{h}_1(Y,A,Z)|Z,A,X,(R_Z,R_X)=\mathbf{1}]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[g^*(A,X)|Z,A,X,(R_Z,R_X)=\mathbf{1}]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}\left[g^*(A,X)|Z,R_Z=1\right] \\ &= \mathbb{E}_\mathrm{obs}\left[f(A,X) + \mathbb{E}_\mathrm{obs}[\epsilon|X]|Z,R_Z=1\right]\\ &= \mathbb{E}_\mathrm{obs}\left[Y|Z,R_Z=1\right]. \end{align*}$$
The second equality holds by $R_ X\perp (Y,R_ Z) \vert Z,A,X.$
(NECESSITY). It suffices to show that any solution of the IES in Theorem 1 satisfies the equation in this Remark. If $h=(h_1,g)$ is a solution of the IES, then $g\overset{\mathrm{a.s.}} = g^{* }$ and
$$\begin{align*} &\mathbb{E}_\mathrm{obs}[g^*(A,X) - h_1(Y,A,Z)|A,X,Z,R_Z = 1]\\ &(R_X\perp (Y,R_Z) \vert Z,A,X)\\ &=\mathbb{E}_\mathrm{obs}[g^*(A,X) - h_1(Y,A,Z)|A,X,Z,(R_Z,R_X) = \mathbf{1}] \overset{\mathrm{a.s.}}{=} 0. \end{align*}$$
Then we finish the proof.
CCB with Proximal Variable (CCB-PV)
The model for CCB-PV is illustrated in the following figure.

Assumption 2. We assume that the following assumptions hold for the observational process of the CCB-PV.
- (PV completeness). For any $a\in\mathcal{A},x\in\mathcal{X},$ $\mathbb{E}_ \mathrm{obs}[\sigma(U)\vert X=x,A=a,Z=z,R_ Z=1] = 0$ holds for any $z\in\mathcal{Z}$ if and only if $\sigma(U)=0$ almost surely.
- (PV independence). $W\perp A\vert (U,X)$ and $Z\perp (Y,W)\vert (A,U,X).$
- (Unconfounded and outcome-independent missingness). We assume that $R_W$ is caused by $(W, X, A),$ $R_X$ is caused by $X,$ and $R_Z$ is caused by $(Z, U, A, X).$
Theorem 2. Suppose that Assumption 2 holds. If there exist bridge functions $h_1:\mathcal{Y}\times\mathcal{A}\times\mathcal{X}\times\mathcal{Z}\to\mathbb{R},$ $h_2:\mathcal{A}\times\mathcal{W}\times\mathcal{X},$ $h_3:\mathcal{Y}\times\mathcal{A}\times\mathcal{X}\times\mathcal{A}\to\mathbb{R}$ and $g:\mathcal{A}\times\mathcal{X}\to\mathbb{R}$ satisfying:
$$\begin{align*} &\mathbb{E}_\mathrm{obs}[h_1(Y,A,X,Z) - Y|A,X,Z,(R_X,R_Z)=\mathbf{1}] = 0 \ \ \mathrm{a.s.},\\ &\mathbb{E}_\mathrm{obs}[h_2(A,W,X) - h_1(Y,A,X,Z)|A,W,X,Z,(R_W,R_X,R_Z)=\mathbf{1}] = 0\ \ \mathrm{a.s.},\\ &\mathbb{E}_\mathrm{obs}[h_3(Y,A,X,a') - h_2(a',W,X)|A,W,X,(R_W,R_X)=\mathbf{1}] = 0\ \ \mathrm{a.s.},\\ &\mathbb{E}_\mathrm{obs}[g(a',X) - h_3(Y,A,X,a')|X,R_X=1] = 0\ \ \mathrm{a.s.} \end{align*}$$
for any $a’\in\mathcal{A},$ then it follows that $g(A,X)\overset{\mathrm{a.s.}}{=} g^{* }(A,X)$ where $g^{* }$ is the CATE.
Proof. We first compute $\mathbb{E}_ \mathrm{obs}[Y\vert A,X,Z,(R_X,R_Z)=\mathbf{1}]:$
$$\begin{align} &\mathbb{E}_\mathrm{obs}[Y\vert A,X,Z,(R_X,R_Z)=\mathbf{1}]\\ &= \mathbb{E}_\mathrm{obs}[h_1(Y,A,X,Z)\vert A,X,Z,(R_X,R_Z)=\mathbf{1}]\\ &= \mathbb{E}_\mathrm{obs}[h_2(A,W,X)\vert A,X,Z,(R_X,R_Z)=\mathbf{1}]. \end{align}$$
The second equality is implied by $R_ W\perp (Y,R_ X,R_ Z)\ \vert\ A,X,W.$ Furthermore,
$$\begin{align*} 0 &= \mathbb{E}_\mathrm{obs}[Y - h_2(A,W,X)\vert A,X,Z,(R_X,R_Z)=\mathbf{1}]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[Y - h_2(A,W,X)\vert A,U,X,Z,(R_X,R_Z)=\mathbf{1}]|A,X,Z,(R_X,R_Z)=\mathbf{1}\right]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[Y - h_2(A,W,X)\vert A,U,X]|A,X,Z,R_Z=1\right]. \end{align*}$$
The third equality holds by $W\perp Z\ \vert\ U,X.$ Then $\mathbb{E}_\mathrm{obs}[Y - h_2(A,W,X)\vert A,U,X] \overset{\mathrm{a.s.}}{=} 0$ follows from PV completeness.
Now we compute $g(a’,X)$ for arbitrary $a’\in\mathcal{A}:$
$$\begin{align*} g(a',X) &= \mathbb{E}_\mathrm{obs}[g(a',X)| X,R_X=1] = \mathbb{E}_\mathrm{obs}[h_3(Y,A,X,a')| X,R_X=1]\\ &\overset{(1)}{=} \mathbb{E}_\mathrm{obs}[h_2(a',W,X)| X,R_X=1] = \mathbb{E}_\mathrm{obs}[h_2(a',W,X)| X]\\ &\overset{(2)}{=} \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[h_2(a',W,X)|X,A=a',U]| X\right]. \end{align*}$$
Here $(1)$ holds by $R_W\perp (R_X,Y)\ \vert A,W,X,$ and $(2)$ holds by $A\perp W\ \vert\ U,X.$
Additionally,
$$\begin{align*} g^*(a,x) &= \mathbb{E}_\mathrm{obs}[Y|X=x,\mathrm{do}(A=a)]\\ &= \mathbb{E}_\mathrm{obs}\left[\mathbb{E}_\mathrm{obs}[Y|X=x,A=a,U]|X=x\right]. \end{align*}$$
Then we can conclude $g(A,X)\overset{\mathrm{a.s.}}{=}g^*(A,X).$
Claim 2. A solution $h= (h_ 1, h_ 2, h_ 3, g)$ to the IES in Theorem 2 exists if and only if the following three conditions are satisfied:
- There exists a solution $h_2$ to $\mathbb{E}_\mathrm{obs}[h_2(A,W,X) - Y\vert A,X,U] \overset{\mathrm{a.s.}}{=} 0.$
- For any solution $h_2,$ there exists a solution $h_1$ to the second equation of the IES in Theorem 2.
- For any solution $h_2,$ there exists a solution $h_3$ to the fourth equation of the IES in Theorem 2.
Remark. The IES in Theorem 2 holds pointwisely for all $a’\in\mathcal{A}.$ We can introduce a pseudo random variable $A’$ which is independent of the CCB-PV model and uniformly distributed on $\mathcal{A},$ wit $a’$ as its realization. Then, the last two equations in Theorem 2 can be transformed into a conditional moment form.
United Form for the IES
The IES discussed in both the CCB-IV and CCB-PV cases can be summerized in a united form:
$$\mathbb{E}_{\mu_k}[\alpha_k(h(X),Y_k)|Z_k]=0,\ k=1,\cdots,K,$$
where
- $K$ denotes the number of equations in the IES,
- $h$ denotes the vector of bridge functions to learn, i.e. $h(X)=\left(h_1(X),\cdots,h_{K-1}(X_{K-1}),g(X_K)\right)$ where $X_k$ is the random variable vector that $h_k$ depends on and $X$ is a union of $\lbrace X_k\rbrace_ {k=1}^K.$
- $\alpha_k$ is the linear function that is taken expectation with in the $k^\mathrm{th}$ equation of the IES, which depends on $h(X)$ and random variable vector $Y_k.$
- $Z_k$ is the random variable vector that is conditioned on in the $k^\mathrm{th}$ equation of the IES, and
- $\mu_k$ denotes the joint distribution for $(X, Y_k, Z_k).$
A linear operator $\mathcal{T}_k:\mathcal{H}\to\mathcal{F}(Z_k)$ can be defined as
$$\mathcal{T}_kh(\cdot):=\mathbb{E}_{\mu_k}[\alpha_k(h(X),Y_k)|Z_k=\cdot].$$
By letting $\mathcal{T}h(z) = (\mathcal{T}h_1(z_1),\cdots,\mathcal{T}h_k(z_k),$ the IES can be alternatively expressed as
$$\mathcal{T}h(z)=0.$$
Estimation
Let $\mathcal{H}=\mathcal{H}_ 1\times\mathcal{H}_ 2\times\mathcal{H}_ {K-1}\times\mathcal{G}$ be the hypothesis class for $h,$ and $\mu=(\mu_1,\cdots,\mu_K).$ Then solving the conditional moment equations is equivalent to minimizing the RMSE within $\mathcal{H}:$
$$\Vert\mathcal{T}h\Vert_{\mu,2}^2 = \sum_{k=1}^K\mathbb{E}_{\mu_k}\left[\left(\mathcal{T}_kh(Z_k)\right)^2\right] = \sum_{k=1}^K\mathbb{E}_{\mu_k}\left[(\mathbb{E}_{\mu_k}[\alpha_k(h(X),Y_k)|Z_k])^2\right].$$
We introduce a test function $\theta_ k\in\Theta_ k$ for each linear operator $\mathcal{T}_ k.$ Apply the Fenchel duality, we have
$$\frac{1}{2}\left(\mathcal{T}_kh(Z_k)\right)^2 = \sup_{\theta_k\in\Theta_k}\left\lbrace\mathcal{T}_kh(Z_k)\theta_k(Z_k) - \frac{1}{2}\theta_k(Z_k)^2\right\rbrace.$$
Then the RMSE can be repleced equivalently by an unconditional moment loss function
$$\mathcal{L}(h) = \sum_{k=1}^K\mathcal{L}_k(h),\ \ \mathcal{L}_k(h) = \sup_{\theta_k\in\Theta_k}\mathbb{E}_{\mu_k}[\alpha_k(h(X),Y_k)\theta_k(Z_k)] - \frac{1}{2}\Vert\theta_k\Vert_{\mu_k,2}^2.$$
Assume that $\Theta_ k$ is a star-shaped class, and let $\mathcal{T}_ k h\in\Theta_ k.$ Then any solution of $\mathcal{T}h(z)=0$ is a minimizer of the unconditional moment loss function. For estimation, we can replace the theoretical distribution $\mu$ with the empirical distribution on our dataset $\mathcal{D}.$ If $\mathcal{D}_ k$ is a subset of $\mathcal{D}$ induced by distribution $\mu_k,$ then $\mathcal{D}_ k\sim\mu_k.$
$$\mathcal{L}_\mathcal{D}(h) = \sum_{k=1}^K\mathcal{L}_{k,\mathcal{D}}(h),\ \ \mathcal{L}_{k,\mathcal{D}}(h) = \sup_{\theta_k\in\Theta_k}\mathbb{E}_{\mathcal{D}_k}[\alpha_k(h(X),Y_k)\theta_k(Z_k)] - \frac{1}{2}\Vert\theta_k\Vert_{\mathcal{D}_k,2}^2.$$
Then the confidence set $\mathrm{CI}_ \mathcal{D}(e_ \mathcal{D})$ as the level set with respect to metric $\mathcal{L}_ \mathcal{D}(\cdot)$ and threshold $e_ \mathcal{D}$ can be consturcted:
$$\begin{align*} \mathrm{CI}_{\mathcal{H},\mathcal{D}}(e_\mathcal{D}) &= \left\lbrace h\in\mathcal{H}: \mathcal{L}_\mathcal{D}(h) \leq \inf_{h'\in\mathcal{H}}\mathcal{L}_\mathcal{D}(h') + e_\mathcal{D}\right\rbrace,\\ \mathrm{CI}_\mathcal{D}(e_\mathcal{D}) &= \left\lbrace g\in\mathcal{G}: \exists h\in\mathrm{CI}_{\mathcal{H},\mathcal{D}}\ \text{s.t.}\ g=h^{(K)}\right\rbrace. \end{align*}$$
Furthermore, the target policy is learned by optimizing the pessimistic average reward function:
$$\widehat{\pi} = \underset{\pi\in\Pi}{\mathrm{argsup}}\inf_{g\in\mathrm{CI}_\mathcal{D}(e_\mathcal{D})} v(g,\pi).$$
The estimated pessimistic CATE under the interventional policy $\pi$ is $\widehat{g}^\pi=\mathrm{arginf}_ {g\in\mathrm{CI}_ \mathcal{D}(e_ \mathcal{D})}v(g,\pi).$
References
- Siyu Chen, Yitan Wang, Zhaoran Wang, Zhuoran Yang, 2023. A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations. arXiv preprint arXiv: 2303.11187.