Switchback Experiments

An overview of switchback (time-split) experiments: what they are, why they are used to solve network interference, and their trade-offs.

Why Use Switchback Experiments?

What is a Switchback Experiment?

How to conduct a test

Bucket level summarization approach

\begin{equation*} X_i \{region_i, time_i, week number (seasonality), total order value, ...\} \end{equation*} So that we can assume ignorability: \begin{equation*} (Y_i(0), Y_i(1)) \perp \text{Bucket}_i | X_i \end{equation*}

TBD.

How to plan

The core plan is determining the experimental unit and number of it.

  1. Define the Region: Choose a specific geographical boundary (e.g., Manhattan, San Francisco).
  2. define the geographic granularity (metro, zone, stores)
    • pick granilariy so that neighboring regions do not intereact much (as possible)
  3. Define the Granularity (Time Window): Partition the timeline into discrete blocks (e.g., 30, 60, or 120 minutes). determine based on expeirnce on carryover effect
  4. how long would we run the switchback experiment? sample size?
  5. Random Assignment: many options of randomizaiton algorithm (paired, unparid…) (often alternating or using Markov chains to maintain balance) to assign each window to Treatment or Control.
  6. Execute:
    • 12:00 PM - 12:30 PM: Control (Standard dispatch algorithm for all users).
    • 12:30 PM - 1:00 PM: Treatment (New dispatch algorithm for all users).
    • 1:00 PM - 1:30 PM: Treatment …

Switchback experiment design overview

Trade-offs and Challenges

While switchbacks mitigate user-to-user interference, they introduce completely different challenges.

1. Carryover Effects (Temporal Interference)

By switching back and forth, the system’s state in one time window bleeds into the next.

Power and sample size calculation

how many randomization units do we need? (not how many users or orders) | Case | Method | | :————————————————————————————– | :————————————————————————————————————————————— | | The randomization unit (e.g., metro-day) is the same as the analysis unit | Power calculation based on two sample t test (similar to power calculation for A/B test) | | The randomization unit (e.g., metro-day) is higher than the analysis unit (e.g., order) | Delta method for variance calculation (need to account for correlation across analysis units within the same randomization unit) |

Analze a switchback experiment

treatment effect

we look for:

THe following is a typical result we get from a switchback experiment | order | zone | date | metro | metric | treatment | | :—- | :— | :——— | :—- | :—– | :——– | | 1 | 1 | 2022-01-01 | SF | 0.5 | 1 | | 2 | 1 | 2022-01-01 | SF | 0.4 | 1 | | 3 | 1 | 2022-01-02 | SF | 0.3 | 0 | | 4 | 2 | 2022-01-01 | SF | 0.4 | 0 | | 5 | 2 | 2022-01-01 | SF | 0.1 | 0 | | 6 | 2 | 2022-01-02 | SF | 0.5 | 1 | | 7 | 3 | 2022-01-01 | LA | 0.3 | 0 | | 8 | 3 | 2022-01-02 | LA | 0.2 | 1 |

Inference

| Case | Method | | :————————————————————————————– | :———————————————————————————————————————- | | The randomization unit (e.g., metro-day) is the same as the analysis unit | 1. Two sample t-test
2. Permutation test
3. More advanced modeling | | The randomization unit (e.g., metro-day) is higher than the analysis unit (e.g., order) | 1. Two sample z-test with variance calculated from the Delta method
2. Permutation test
3. More advanced modeling |

Conclusion

User-level A/B testing is fundamentally broken for interventions that impact shared physical supply. Despite the challenges of carryover effects and lower statistical power, switchback experimentation remains the gold standard for measuring the true global impact of marketplace algorithms.

References