Principles and practice of matching and weighting estimator for causal inference.
In the potential outcomes framework, for each unit $i$, we define:
The Individual Treatment Effect (ITE): \(\tau_i = Y_i(1) - Y_i(0)\) is a random variable. It is not observed since we only observe one outcome for each unit: \(Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0)\)
Since we cannot calculate $\tau_i$ directly, we instead estimate Average Treatment Effect (ATE): \(\tau_{ATE} = E[Y(1) - Y(0)]\)
In observational studies where the treatment allocation probability $\pi$ is not equal to $0.5$, one might use the Naive Estimator: \(\hat{\delta}_{naive} = \hat{E}[Y | D=1] - \hat{E}[Y | D=0]\) which converges to \(E[Y(1) | D=1] - E[Y(0) | D=0]\) which consists of conditional expectation parameters. The counterfactual framework has four key conditional expectations (plus the treatment probability $\pi$):
| $E[Y(1) | D=1]$: Treated outcome for the treated. |
| $E[Y(0) | D=1]$: Untreated outcome for the treated (Counterfactual). |
| $E[Y(1) | D=0]$: Treated outcome for the control (Counterfactual). |
| $E[Y(0) | D=0]$: Untreated outcome for the control. |
The naive estimator consistently estimate 1 and 4. The treatment probability ($\hat{E}[D] \xrightarrow{p} \pi$) is also consistently estimated. So using a naive esitimator means using these three quantities. ATE is expressed by these five parameters as follows: \(\tau_{ATE} = \pi \underbrace{(E[Y(1)|D=1] - E[Y(0)|D=1])}_{\text{ATT}} + (1-\pi) \underbrace{(E[Y(1)|D=0] - E[Y(0)|D=0])}_{\text{ATU}}\) If we attempt to equate the Naive Estimator to the true ATE, the terms do not cancel out gracefully without some assumption because there are many possible values or 2 and 3 that the equation do not hold. However, we lack estimators for the two counterfactual terms (2 and 3).
If we assume \(E[Y(1) | D=1] = E[Y(1) | D=0]\) \(E[Y(0) | D=1] = E[Y(0) | D=0]\) then the Naive Estimator is a valid estimator for ATE (plug those in to the expansion above to check this). This is a quite strong assumption. We can create many situations where $D$ and $Y(0), Y(1)$ are both functions of $X$, and (not thus) the assumption is violated.
So we want to assume a milder assumption. For each value of $X$, there will be corresponding subpopulation. Let’s assume that $X$ is informative and granular enough that within each subpopulation, the treatment assignment is effectively random. This would be true if the treatment was assigned by Bernoulli trial with probability $\pi(X)$.
Then, we only need to assume \(E[Y(1) | D=1, X] = E[Y(1) | D=0, X]\) \(E[Y(0) | D=1, X] = E[Y(0) | D=0, X].\)
Under these assumptions, we can consistenly estiamte the CATE: \(\tau_{CATE}(X) = E[Y(1) | X] - E[Y(0) | X]\) \(= E[Y(1) | D=1, X] - E[Y(0) | D=0, X]\) \(= E[Y(1) | D=1, X] - E[Y(0) | D=1, X]\) \(= E[Y | D=1, X] - E[Y | D=0, X]\) by the conditional sample mean difference.
By the law of iterated expecation, ATE is \(\tau_{ATE} = E[\tau_{CATE}(X)]\) Since CATE is a function of $X$, the expectation above is taken over $X$ with probability density function $f(x)$. Since we beleive that $X$ is generated from $f(X)$, empirical measure plug-in version is just sample mean of CATE values for each $X_i$ in the dataset. \(\hat{\tau}_{ATE} = \frac{1}{n} \sum_{i=1}^n \hat{\tau}_{CATE}(X_i)\)
If we can compute $\hat{\tau}_{CATE}(X_i)$ for all $X_i$ in our dataset. So this is the basic idea for matching: compute for each covariate, and aggregate. Divide and conquer.