In this blog we continue to discuss combining experimental and observational data. Last blog: Combining Experimental and Observational Data (I).

Observational Data with Treatment and Outcome

In observational data, the actual data generating process is unknown and there exists unmeasured confounding in general, i.e.

\[\lbrace Y(1),Y(0)\rbrace\not\perp A\ |\ X.\]

Based on observational data, we can obtain the standard causal inference estimator $\hat{\tau}^{\mathcal{O}}_m$ of ATE $\tau,$ and $\hat{\tau}^{\mathcal{O}}_m(x)$ of CATE $\tau(x),$ with some confounding bias:

$$\lim_{m\to\infty}\hat{\tau}^\mathcal{O}_m \neq \tau,\ \text{and}\ \lim_{m\to\infty}\hat{\tau}^\mathcal{O}_m(x) \neq \tau(x).$$

Deconfounding: Model the bias

This idea comes from Kallus et al. (2018). Define a bias function $\eta(x)\neq 0$ to model the discrepancy between the true CATE and the estimated CATE:

$$\eta(x) := \tau(x) - \tau^\mathcal{O}(x).$$

Here $\tau^{\mathcal{O}}(x) = \mathbb{E}[Y\vert X=x,A=1] - \mathbb{E}[Y\vert X=x,A=0].$ Our estimator $\hat{\tau}^\mathcal{O}_m(x)$ is obtained by plugging-in $\tau^{\mathcal{O}}(x)$ since it is identifiable. Suppose $\eta(x)$ can be well approximated by a function with low complexity. We use a linear model to simulate the bias:

$$\tau(x) = \tau^\mathcal{O}(x) + \theta^\top x,\ \theta\in\mathbb{R}^p.$$

Now, we use a reweighting approach to obtain the expression of $\tau$:

$$\tau^*_i = \left(\frac{A_i}{e(X_i)} - \frac{1-A_i}{1-e(X_i)}\right)Y_i.$$

Then we can learn $\theta$ through a least-squares approach on the RCT sample:

$$\hat{\theta} = \underset{\theta\in\mathbb{R}^p}{\mathrm{argmin}}\sum_{i=1}^n\left(\tau_i^* - \hat{\tau}_m^\mathcal{O}(X_i) - \theta^\top X_i\right)^2$$

note that $\hat{\tau}_m^\mathcal{O}(\cdot)$ is learned on the observational data. Using the estimated $\hat{\theta},$ we can recover the causal effect:

$$\hat{\tau}_{n,m}(x) = \hat{\tau}_m^\mathcal{O}(x) + \hat{\theta}_{n,m}^\top x.$$

Under some regularity conditions, the $\hat{\theta}_{n,m}$ estimated through least squares is consistent, and $\hat{\tau}(\cdot)$ is consistent on its target population.

Deconfounding: Model the confounding

This idea comes from Yang et al. (2020). Different from the previous discussion, we use $S=1$ to denote the RCT sample and $S=0$ the observational data.

Assumption 1 (Transportability and strong ignorability of trial treatment assignment).

  • $\mathbb{E}[Y(1)-Y(0)\vert X,S=1] = \mathbb{E}[Y(1) - Y(0)\vert X,S=0] = \tau(X),$ and
  • $Y(a)\perp A\ \vert\ (X,S=1)\ \text{for}\ a\in\lbrace 0,1\rbrace,$ and $e(X,S)>0$ almost surely.

We denote the CATE $\tau(\cdot)$ and confounding function by

$$\begin{align} \tau(x) &= \mathbb{E}[Y(1)-Y(0)|X=x],\\ \lambda(x) &= \mathbb{E}[Y(0)|A=1,X=x,S=0] - \mathbb{E}[Y(0)|A=0,X=x,S=0]. \end{align}$$

Both of them can be identified:

$$\begin{align} \tau(x) &= \mathbb{E}[Y|A=1,X=x,S=1] - \mathbb{E}[Y|A=0,X=x,S=1],\\ \lambda(x) &= \mathbb{E}[Y|A=1,X=x,S=0] - \mathbb{E}[Y|A=0,X=x,S=0] - \tau(x). \end{align}$$

We parameterize $\tau(\cdot)$ and $\lambda(\cdot)$ as follows:

Assumption 2 (Parametric tructural model). The heterogeneity of treatment effect and confounding functions are

$$\tau(x)=\tau_{\varphi_0}(x),\ \lambda(x)=\lambda_{\phi_0}(x),$$

where $\psi_0 = (\varphi_0^\top,\phi_0^\top)^\top\in\mathbb{R}^{p_1}\times\mathbb{R}^{p_2}.$

Remark. A crude estimator of $\psi_0$ can be obtained by least squares, since $\tau_ {\varphi_ 0}(x)$ and $\lambda_ {\phi_ 0}(x)$ are idetified.

Note that

$$\begin{align} \mathbb{E}[Y-\tau(X)A|A,X,S] &= \mathbb{E}[Y(A)-A(Y(1)-Y(0))|A,X,S] = \mathbb{E}[Y(0)|A,X,S], \end{align}$$

that by Assumption 1

$$\mathbb{E}[Y(0)|A,X,S=1] - \mathbb{E}[Y(0)|X,S=1] = 0,$$

and that

$$\begin{align} &\mathbb{E}[Y(0)|A,X,S=0] - \mathbb{E}[Y(0)|X,S=0] \\ &= \mathbb{E}[Y(0)|A,X,S=0] - \mathbb{E}[Y(0)|A=1,X,S=0]\times e(X,0)\\ &\quad - \mathbb{E}[Y(0)|A=0,X,S=0]\times(1-e(X,0))\\ &= \left(A-e(X,0)\right)\left\lbrace\mathbb{E}[Y(0)|A=1,X,S=0]-\mathbb{E}[Y(0)|A=0,X,S=0]\right\rbrace\\ &= \lambda_{\phi_0}(X)\left(A-e(X,0)\right). \end{align}$$

Then, we introduce the following variable to mimic $Y(0):$

$$H_{\psi_0} = Y-\tau_{\varphi_0}(X)A - (1-S)\lambda_{\phi_0}(X)(A-e(X)).$$

Proposition 1. Under Assumptions 1 and 2, we have $\mathbb{E}[H_{\psi_0}\vert A,X,S] = \mathbb{E}[Y(0)\vert X,S].$

Furthermore, the semiparametric efficiency score of $\psi_0$ can be derived. An estimating equation using this score is applied to solve a semiparametric efficient estimator of $\psi_0.$

References

  • Colnet, B., Mayer, I., Chen, G., Dieng, A., Li, R., Varoquaux, G., Vert, J.P., Josse, J. and Yang, S. (2020). Causal inference methods for combining randomized trials and observational studies: a review. arXiv preprint arXiv:2011.08047.
  • Kallus, N., Puli, A. M., and Shalit, U. (2018). Removing hidden confounding by experimental grounding. In Advances in Neural Information Processing Systems, pages 10888–10897.
  • Yang, S., Zeng, D., and Wang, X. (2020). Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding. arXiv preprint arXiv:2007.12922.

<
Previous Post
Combining Experimental and Observational Data (I)
>
Next Post
Reinforcement Learning: Markov Decision Process with Prior Causal Knowledge