Fundamentals of Matching and Weighting Estimator for Causal Inference

Principles and practice of matching and weighting estimator for causal inference.

I. The Counterfactual Framework (Rubin Causal Model)

In the potential outcomes framework, for each unit $i$, we define:

Treatment ($D_i$): A binary indicator.
Covariates ($X_i$): A vector of pre-treatment characteristics.
Potential Outcomes: $Y_i(1), Y_i(0)$.

The Individual Treatment Effect (ITE): $\tau_i = Y_i(1) - Y_i(0)$ is a random variable. It is not observed since we only observe one outcome for each unit: $Y_i = D_i Y_i(1) + (1 - D_i) Y_i(0)$

Since we cannot calculate $\tau_i$ directly, we instead estimate Average Treatment Effect (ATE): $\tau_{ATE} = E[Y(1) - Y(0)]$

II. The Naive Estimator and Selection Bias

1. The Naive Comparison

In observational studies where the treatment allocation probability $\pi$ is not equal to $0.5$, one might use the Naive Estimator: $\hat{\delta}_{naive} = \hat{E}[Y | D=1] - \hat{E}[Y | D=0]$ which converges to $E[Y(1) | D=1] - E[Y(0) | D=0]$ which consists of conditional expectation parameters. The counterfactual framework has four key conditional expectations (plus the treatment probability $\pi$):

$E[Y(1) D=1]$: Treated outcome for the treated.
$E[Y(0) D=1]$: Untreated outcome for the treated (Counterfactual).
$E[Y(1) D=0]$: Treated outcome for the control (Counterfactual).
$E[Y(0) D=0]$: Untreated outcome for the control.

The naive estimator consistently estimate 1 and 4. The treatment probability ($\hat{E}[D] \xrightarrow{p} \pi$) is also consistently estimated. So using a naive esitimator means using these three quantities. ATE is expressed by these five parameters as follows: $\tau_{ATE} = \pi \underbrace{(E[Y(1)|D=1] - E[Y(0)|D=1])}_{\text{ATT}} + (1-\pi) \underbrace{(E[Y(1)|D=0] - E[Y(0)|D=0])}_{\text{ATU}}$ If we attempt to equate the Naive Estimator to the true ATE, the terms do not cancel out gracefully without some assumption because there are many possible values or 2 and 3 that the equation do not hold. However, we lack estimators for the two counterfactual terms (2 and 3).

Assumptions

If we assume $E[Y(1) | D=1] = E[Y(1) | D=0]$ $E[Y(0) | D=1] = E[Y(0) | D=0]$ then the Naive Estimator is a valid estimator for ATE (plug those in to the expansion above to check this). This is a quite strong assumption. We can create many situations where $D$ and $Y(0), Y(1)$ are both functions of $X$, and (not thus) the assumption is violated.

So we want to assume a milder assumption. For each value of $X$, there will be corresponding subpopulation. Let’s assume that $X$ is informative and granular enough that within each subpopulation, the treatment assignment is effectively random. This would be true if the treatment was assigned by Bernoulli trial with probability $\pi(X)$.

Then, we only need to assume $E[Y(1) | D=1, X] = E[Y(1) | D=0, X]$ $E[Y(0) | D=1, X] = E[Y(0) | D=0, X].$

Under these assumptions, we can consistenly estiamte the CATE: $\tau_{CATE}(X) = E[Y(1) | X] - E[Y(0) | X]$ $= E[Y(1) | D=1, X] - E[Y(0) | D=0, X]$ $= E[Y(1) | D=1, X] - E[Y(0) | D=1, X]$ $= E[Y | D=1, X] - E[Y | D=0, X]$ by the conditional sample mean difference.

By the law of iterated expecation, ATE is $\tau_{ATE} = E[\tau_{CATE}(X)]$ Since CATE is a function of $X$, the expectation above is taken over $X$ with probability density function $f(x)$. Since we beleive that $X$ is generated from $f(X)$, empirical measure plug-in version is just sample mean of CATE values for each $X_i$ in the dataset. $\hat{\tau}_{ATE} = \frac{1}{n} \sum_{i=1}^n \hat{\tau}_{CATE}(X_i)$
If we can compute $\hat{\tau}_{CATE}(X_i)$ for all $X_i$ in our dataset. So this is the basic idea for matching: compute for each covariate, and aggregate. Divide and conquer.