An intuitive derivation of the Inverse Probability Weighting (IPW) estimator from a single RCT to multiple RCTs and observational data.
This post shows that somewhat intimitating form of IPW estimator
$$ \begin{equation*} \hat{\tau}_{IPW} = \frac{1}{n} \sum_{i=1}^n \left( \frac{W_i Y_i}{\hat{e}(X_i)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(X_i)} \right) \end{equation*} $$
is actually just a weighted average of many mean differences.
$$ \begin{align*} \hat{\tau} &= \left( \text{Average of Treated} \right) - \left( \text{Average of Control} \right)\\&= \frac{1}{n_1} \sum_{W_i=1} Y_i - \frac{1}{n_0} \sum_{W_i=0} Y_i \end{align*} $$
Derivation to IPW form: First, rewrite the sums over subsets as sums over the entire sample \(n\) by using \(W_i\) and \((1 - W_i)\) to zero out the irrelevant terms:
$$ \begin{equation*} \hat{\tau} = \sum_{i=1}^n \frac{W_i Y_i}{n_1} - \sum_{i=1}^n \frac{(1 - W_i) Y_i}{n_0} \end{equation*} $$
Next, define the constant propensity score \(e = \frac{n_1}{n}\). This implies:
$$ \begin{align*} n_1 &= n \cdot e\\ n_0 &= n \cdot (1 - e) \end{align*} $$
Substitute these into the denominators:
$$ \begin{equation*} \hat{\tau} = \sum_{i=1}^n \frac{W_i Y_i}{n \cdot e} - \sum_{i=1}^n \frac{(1 - W_i) Y_i}{n \cdot (1 - e)} \end{equation*} $$
Finally, factor out \(\frac{1}{n}\) to reveal the IPW form:
$$ \begin{equation*} \hat{\tau} = \frac{1}{n} \sum_{i=1}^n \left( \frac{W_i Y_i}{e} - \frac{(1 - W_i) Y_i}{1 - e} \right) \end{equation*} $$
Let’s say we have two separate RCTs, one in City A and one in City B. The intuitive approach is to calculate the ATE for each city separately, and then take a weighted average based on their population sizes.
$$ \begin{equation*} \hat{\tau}_\{\mathrm{AGG}\} = \frac{n_A}{n} \hat{\tau}_A + \frac{n_B}{n} \hat{\tau}_B \end{equation*} $$
This is the sample version of:
$$ \begin{equation*} \tau = \mathbb{E}[\tau(X)\,|\,X] = \sum P(X) \tau(X) \end{equation*} $$
Derivation to IPW form: Using the logic from Step 1, we can write the ATE for City A (\(\hat{\tau}_A\)) in its local IPW form, where \(\hat{e}_A\) is the specific treatment probability in City A:
$$ \begin{equation*} \hat{\tau}_A = \frac{1}{n_A} \sum_{i \in A} \left( \frac{W_i Y_i}{\hat{e}_A} - \frac{(1 - W_i) Y_i}{1 - \hat{e}_A} \right) \end{equation*} $$
Now, plug this back into the intuitive aggregation formula. Notice how the city population size $n_A$ perfectly cancels out:
$$ \begin{align*} \frac{n_A}{n} \hat{\tau}_A &= \frac{n_A}{n} \left[ \frac{1}{n_A} \sum_{i \in A} \left( \frac{W_i Y_i}{\hat{e}_A} - \frac{(1 - W_i) Y_i}{1 - \hat{e}_A} \right) \right] \\&= \frac{1}{n} \sum_{i \in A} \left( \frac{W_i Y_i}{\hat{e}_A} - \frac{(1 - W_i) Y_i}{1 - \hat{e}_A} \right) \end{align*} $$
When you do this for both City A and City B and add them together, summing over City A plus summing over City B is exactly the same as summing over the entire dataset \(n\):
$$ \begin{align*} \hat{\tau}_\{\mathrm{AGG}\} &= \frac{1}{n} \sum_{i=1}^n \left( \frac{W_i Y_i}{\hat{e}(C_i)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(C_i)} \right) \end{align*} $$
This is the exact same mathematical mechanism as the Two RCT case, generalized to $p$ finite groups defined by covariates $X_i = x$:
$$ \begin{equation*} \hat{\tau}_\{\mathrm{AGG}\} = \sum_{x \in \mathcal{X}} \frac{n_x}{n} \hat{\tau}(x) \end{equation*} $$
Where the group-specific ATE is:
$$ \begin{equation*} \hat{\tau}(x) = \frac{1}{n_{x,1}} \sum_{X_i=x, W_i=1} Y_i - \frac{1}{n_{x,0}} \sum_{X_i=x, W_i=0} Y_i \end{equation*} $$
Derivation to IPW form: First, express the denominators $n_{x,1}$ and $n_{x,0}$ using the group-specific propensity score $\hat{e}(x) = \frac{n_{x,1}}{n_x}$:
$$ \begin{align*} n_{x,1} &= n_x \hat{e}(x) \\ n_{x,0} &= n_x (1 - \hat{e}(x)) \end{align*} $$
Rewrite $\hat{\tau}(x)$ using the $W_i$ filter trick to sum over everyone in group $x$:
$$ \begin{align*} \hat{\tau}(x) &= \sum_{i: X_i=x} \left( \frac{W_i Y_i}{n_x \hat{e}(x)} - \frac{(1 - W_i) Y_i}{n_x (1 - \hat{e}(x))} \right) \\ &= \frac{1}{n_x} \sum_{i: X_i=x} \left( \frac{W_i Y_i}{\hat{e}(x)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(x)} \right) \end{align*} $$
Now, plug this \(\hat{\tau}(x)\) back into the outer aggregation sum $\hat{\tau}_{\mathrm{AGG}}$. Once again, the subgroup sizes $n_x$ cancel out:
$$ \begin{align*} \hat{\tau}_\{\mathrm{AGG}\} &= \sum_{x \in \mathcal{X}} \frac{n_x}{n} \left[ \frac{1}{n_x} \sum_{i: X_i=x} \left( \frac{W_i Y_i}{\hat{e}(x)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(x)} \right) \right] \\ &= \sum_{x \in \mathcal{X}} \frac{1}{n} \sum_{i: X_i=x} \left( \frac{W_i Y_i}{\hat{e}(X_i)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(X_i)} \right) \end{align*} $$
Summing across all discrete groups $x$ is equivalent to a single sum over all $n$ individuals in the dataset, leaving us with the final, generalized formula:
$$ \begin{equation*} \hat{\tau}_\{\mathrm{AGG}\} = \frac{1}{n} \sum_{i=1}^n \left( \frac{W_i Y_i}{\hat{e}(X_i)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(X_i)} \right) \end{equation*} $$
Here we just convince ourselves that the same logic applies to continuous covariates. The only difference is that we can no longer calculate the empirical fraction $\hat{e}(X_i)$ because the group size $n_x$ often drops to 1 or 0. We just replace the empirical fraction $\hat{e}(X_i)$ with a modeled probability from a continuous function (like logistic regression):
$$ \begin{equation*} \hat{\tau}_{IPW} = \frac{1}{n} \sum_{i=1}^n \left( \frac{W_i Y_i}{\hat{e}(X_i)} - \frac{(1 - W_i) Y_i}{1 - \hat{e}(X_i)} \right) \end{equation*} $$
We have to make sure that we have RCT for each covariate fixed. This means that we need to have ignorability and overlap assumptions, just as in \href{https://jong-min.org/blog/2025/when-does-linear-regression-yield-causal-inference/}{causal inference by linear regression}. The difference is that we no longer need the well specified linear model assumption