Violation of SUTVA in A/B Testing: network interference

A summary of the Lyft Engineering blog post 'Interference Across a Network' detailing how naive A/B testing can bias effect estimates in ridesharing.

Network interference in A/B testing

The Problem: Network Dynamics and Pricing

Network dynamics from supply and demand elasticity

Causal estimand

\begin{equation*} \tau := E[Y_A + Y_B | W_A=1, W_B=1] - E[Y_A + Y_B | W_A=0, W_B=0] \end{equation*}

Rider behavior model

If there are parallel universes

Global Control (No Subsidy for both)

\begin{equation*} E[Y_A + Y_B | W_A=0, W_B=0] = 0.5*0.75 + 0.5*0.75 = 0.75 \end{equation*}

Global Treatment (Subsidy for All)

\begin{equation*} E[Y_A + Y_B | W_A=1, W_B=1] = 0.5 + 0.5 = 1 \end{equation*}

The true causal treatment effect of the subsidy is an increase in completed rides from 0.75 to 1.0, a 33% increase.

Naive A/B Testing: Randomizing Users

User A (Control; no subsidy; surge priced)

\begin{equation*} E[Y_A + Y_B | W_A=0, W_B=1] = 0.25 \end{equation*}

User B (Treatment; subsidy; not surge priced)

\begin{equation*} E[Y_A + Y_B | W_A=1, W_B=0] = 0.5*1 + 0.5*0.5*1 = 0.75 \end{equation*}

Estimated Treatment Effect

\begin{equation*} \hat{\tau} = \frac{0.75 - 0.25}{0.25} = 2 \end{equation*}

The naive A/B test estimates a 200% increase in rides, overestimating the true global effect (33%) by a factor of six!

Statistical Interference

Upward interference bias

Downward interference bias

Remedies

Randomization Unit Bias Variance
Space (geohash) High Low
Time interval (hour) Mid Mid
Coarse spatial units (city) Low High

In its earlier days, Lyft dealt with this by randomizing time intervals (alternating the network between a global treatment period and a global control period) to combat bias despite the higher variance.

Remedies in practice (actually, simulations)

We run A/B testing with three different randomization units:

Metrics

We focus on three key metrics for this simulation study. In all cases we measure percent changes in the metric in the treatment group relative to the control group.

  1. Availability: defined as the proportion of user app opens for which there is an available Lyft driver within some context-dependent radius. Since we are handling undersupply, this is an important metric.
  2. ETA: the average (across user sessions) estimated time of arrival (ETA) of the nearest available driver. ETA is one measure of the quality of Lyft’s service levels.
  3. Rides: the number of completed Lyft rides, normalized by group size. Rides is Lyft’s most important top-line business metric, but in a simulation setting is somewhat dependent on the models of passenger and driver behavior.

Ground truth

To get the ground truth values, we run simulation over the entire period twice:

As a result, we observe the following ground truths about the treatment:

Results

Lyft granularity diagram

References