When standard experiments fail: utilizing Synthetic Controls, managing experimentation culture, and understanding various treatment effects.
While randomized controlled experiments (A/B testing) remain the gold standard for drawing causal inferences, there are many scenarios where they are deeply flawed or impossible to implement.
Standard experimental techniques—ranging from individual-level A/B testing to spatial/cluster randomization or switchback experiments—become unworkable when:
In these cases, we rely on causal inference techniques designed for observational data. Arguably the most important innovation in policy evaluation in recent decades is Synthetic Control.
Uber launched in the US as a card-only service but later expanded into cash-heavy markets like Latin America and India. While accepting cash unlocked new rider segments, it introduced operational friction—specifically, drivers having to carry change and Uber struggling to collect its commissions.
To evaluate this, Uber ran an experiment: showing drivers the payment method upfront to measure the impact on trip acceptance rates and unpaid service fees.
However, standard A/B testing fails here due to network interference (the spillover effect). If drivers in the treatment group prefer cash and systematically decline card trips, they will consume the supply of cash trips. Consequently, the control group is starved of cash trips, artificially skewing the experiment’s results even though they cannot see the payment types.
How about switchback experiment? We can fix city and switch back and forth between treatment and control, over different time intervals. See here for the example from Doordash’s algorithm change experiment and Lyft’s surge pricing subsidy experiment. Note that these features are not user-facing. They can be silently deployed. 
Cash trip is a user facing feature. Driver can be both in control and treatment group across different time bucket. Therefore we cannot use switchback experiment for this cash trip experiment. Even for the algorithmic changes, the user might notice the switchback experiment these days so it is getting harder to run switchback experiments. For example, Nick Jones at Uber mentioned that they could not use switchback experiment for their surge pricing algorithm change experiment.
Synthetic control “is arguably the most important innovation in the policy evaluation literature in the last 15 years”
To measure this, we need to answer the counterfactual: “What would have happened to Miami if we didn’t launch the feature?” We don’t have this information, what we have is:
Instead of picking one flawed control, Synthetic Control construct a “Synthetic Miami” using a weighted average of several untreated cities (the “donor pool”).
When running experiments in the tech industry, the goal fundamentally shifts away from purely “scientific” rigidity.
In a traditional scientific approach, experiments are sized (via Power Analysis) strictly to reject false hypotheses and accept true ones. You set a sample size $N$ optimized to detect an effect $X$ with statistical significance, and you run the test to completion.
The Problem: This wastes time and sample size. Tech companies have to prioritize velocity. They have an endless backlog of features to test, and the limitation is network bandwidth/sample size.
The Solution: Discovery-driven or Adaptive Experimentation. We “peek” at the results smartly using upper and lower dynamic thresholds. If a product is clearly amazing early on, we stop the experiment, declare victory, and ship it. If it’s clearly a dud or causing harm, we shut it down immediately to save sample size for the next idea.
Standard A/B tests isolate single features, making it impossible to answer: “What is the aggregate effect of everything we shipped this quarter on long-term retention?”
To measure long-term, multi-feature impact, companies utilize a Universal Holdout. At the beginning of a quarter (or year), a small set of users is completely held back from receiving any new products or experiments. At the end of the time frame, analyzing the difference between this universal control group and the general population reveals the cumulative impact of the entire product roadmap.
When does an experiment cross ethical boundaries?
Depending on the randomization and compliance, evaluating causality yields different definitions of “Effect”: