Combining Experimental and Observational Data (II)

In this blog we continue to discuss combining experimental and observational data. Last blog: Combining Experimental and Observational Data (I).

Observational Data with Treatment and Outcome

In observational data, the actual data generating process is unknown and there exists unmeasured confounding in general, i.e.

\[\lbrace Y(1),Y(0)\rbrace\not\perp A\ |\ X.\]

Based on observational data, we can obtain the standard causal inference estimator $\hat{\tau}^{\mathcal{O}}_m$ of ATE $\tau,$ and $\hat{\tau}^{\mathcal{O}}_m(x)$ of CATE $\tau(x),$ with some confounding bias:

$$\lim_{m\to\infty}\hat{\tau}^\mathcal{O}_m \neq \tau,\ \text{and}\ \lim_{m\to\infty}\hat{\tau}^\mathcal{O}_m(x) \neq \tau(x).$$

Deconfounding: Model the bias

This idea comes from Kallus et al. (2018). Define a bias function $\eta(x)\neq 0$ to model the discrepancy between the true CATE and the estimated CATE:

$$\eta(x) := \tau(x) - \tau^\mathcal{O}(x).$$

Here $\tau^{\mathcal{O}}(x) = \mathbb{E}[Y\vert X=x,A=1] - \mathbb{E}[Y\vert X=x,A=0].$ Our estimator $\hat{\tau}^\mathcal{O}_m(x)$ is obtained by plugging-in $\tau^{\mathcal{O}}(x)$ since it is identifiable. Suppose $\eta(x)$ can be well approximated by a function with low complexity. We use a linear model to simulate the bias:

$$\tau(x) = \tau^\mathcal{O}(x) + \theta^\top x,\ \theta\in\mathbb{R}^p.$$

Now, we use a reweighting approach to obtain the expression of $\tau$:

$$\tau^*_i = \left(\frac{A_i}{e(X_i)} - \frac{1-A_i}{1-e(X_i)}\right)Y_i.$$

Then we can learn $\theta$ through a least-squares approach on the RCT sample:

$$\hat{\theta} = \underset{\theta\in\mathbb{R}^p}{\mathrm{argmin}}\sum_{i=1}^n\left(\tau_i^* - \hat{\tau}_m^\mathcal{O}(X_i) - \theta^\top X_i\right)^2$$

note that $\hat{\tau}_m^\mathcal{O}(\cdot)$ is learned on the observational data. Using the estimated $\hat{\theta},$ we can recover the causal effect:

$$\hat{\tau}_{n,m}(x) = \hat{\tau}_m^\mathcal{O}(x) + \hat{\theta}_{n,m}^\top x.$$

Under some regularity conditions, the $\hat{\theta}_{n,m}$ estimated through least squares is consistent, and $\hat{\tau}(\cdot)$ is consistent on its target population.

Deconfounding: Model the confounding

This idea comes from Yang et al. (2020). Different from the previous discussion, we use $S=1$ to denote the RCT sample and $S=0$ the observational data.

Assumption 1 (Transportability and strong ignorability of trial treatment assignment).

$\mathbb{E}[Y(1)-Y(0)\vert X,S=1] = \mathbb{E}[Y(1) - Y(0)\vert X,S=0] = \tau(X),$ and
$Y(a)\perp A\ \vert\ (X,S=1)\ \text{for}\ a\in\lbrace 0,1\rbrace,$ and $e(X,S)>0$ almost surely.

We denote the CATE $\tau(\cdot)$ and confounding function by

$$\begin{align} \tau(x) &= \mathbb{E}[Y(1)-Y(0)|X=x],\\ \lambda(x) &= \mathbb{E}[Y(0)|A=1,X=x,S=0] - \mathbb{E}[Y(0)|A=0,X=x,S=0]. \end{align}$$

Both of them can be identified:

$$\begin{align} \tau(x) &= \mathbb{E}[Y|A=1,X=x,S=1] - \mathbb{E}[Y|A=0,X=x,S=1],\\ \lambda(x) &= \mathbb{E}[Y|A=1,X=x,S=0] - \mathbb{E}[Y|A=0,X=x,S=0] - \tau(x). \end{align}$$

We parameterize $\tau(\cdot)$ and $\lambda(\cdot)$ as follows:

Assumption 2 (Parametric tructural model). The heterogeneity of treatment effect and confounding functions are

$$\tau(x)=\tau_{\varphi_0}(x),\ \lambda(x)=\lambda_{\phi_0}(x),$$

where $\psi_0 = (\varphi_0^\top,\phi_0^\top)^\top\in\mathbb{R}^{p_1}\times\mathbb{R}^{p_2}.$

Remark. A crude estimator of $\psi_0$ can be obtained by least squares, since $\tau_ {\varphi_ 0}(x)$ and $\lambda_ {\phi_ 0}(x)$ are idetified.

Note that

$$\begin{align} \mathbb{E}[Y-\tau(X)A|A,X,S] &= \mathbb{E}[Y(A)-A(Y(1)-Y(0))|A,X,S] = \mathbb{E}[Y(0)|A,X,S], \end{align}$$

that by Assumption 1

$$\mathbb{E}[Y(0)|A,X,S=1] - \mathbb{E}[Y(0)|X,S=1] = 0,$$

and that

$$\begin{align} &\mathbb{E}[Y(0)|A,X,S=0] - \mathbb{E}[Y(0)|X,S=0] \\ &= \mathbb{E}[Y(0)|A,X,S=0] - \mathbb{E}[Y(0)|A=1,X,S=0]\times e(X,0)\\ &\quad - \mathbb{E}[Y(0)|A=0,X,S=0]\times(1-e(X,0))\\ &= \left(A-e(X,0)\right)\left\lbrace\mathbb{E}[Y(0)|A=1,X,S=0]-\mathbb{E}[Y(0)|A=0,X,S=0]\right\rbrace\\ &= \lambda_{\phi_0}(X)\left(A-e(X,0)\right). \end{align}$$

Then, we introduce the following variable to mimic $Y(0):$

$$H_{\psi_0} = Y-\tau_{\varphi_0}(X)A - (1-S)\lambda_{\phi_0}(X)(A-e(X)).$$

Proposition 1. Under Assumptions 1 and 2, we have $\mathbb{E}[H_{\psi_0}\vert A,X,S] = \mathbb{E}[Y(0)\vert X,S].$

Furthermore, the semiparametric efficiency score of $\psi_0$ can be derived. An estimating equation using this score is applied to solve a semiparametric efficient estimator of $\psi_0.$

References

Colnet, B., Mayer, I., Chen, G., Dieng, A., Li, R., Varoquaux, G., Vert, J.P., Josse, J. and Yang, S. (2020). Causal inference methods for combining randomized trials and observational studies: a review. arXiv preprint arXiv:2011.08047.
Kallus, N., Puli, A. M., and Shalit, U. (2018). Removing hidden confounding by experimental grounding. In Advances in Neural Information Processing Systems, pages 10888–10897.
Yang, S., Zeng, D., and Wang, X. (2020). Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. arXiv preprint arXiv:2007.12922.

Combining Experimental and Observational Data (I)

Reinforcement Learning: Markov Decision Process with Prior Causal Knowledge