Offline Dynamic Pricing under Covariate Shift and Local Differential Privacy via Twofold Pessimism

We study pricing policy learning from batched contextual bandit data under market shift and privacy protection. Market shift is modeled as covariate shift, where the relationship among treatments, features, and rewards remains invariant, while privacy is enforced through local differential privacy (LDP), which perturbs each data point before use. Viewing the off-policy setting, covariate shift, and LDP collectively as forms of distributional shift, we develop a policy learning algorithm based on a unified pessimism principle that addresses all three shifts. Without privacy, we estimate the conditional reward via nonparametric regression and quantify its variance to construct a pessimistic estimator, yielding a policy with minimax-optimal decision error. Under LDP, we apply the Laplace mechanism and adjust the pessimistic estimator to account for additional uncertainty from privacy noise. The resulting doubly pessimistic objective is then optimized to determine the final pricing policy.

This work was presented at NeurIPS 2025 MLxOR workshop. You can also check out the full paper here.

References

differential privacy, nonparametric statistics, reinforcement learning/bandits, causal inference

Offline Dynamic Pricing under Covariate Shift and Local Differential Privacy via Twofold Pessimism

Jongmin Mun, Xiaocong Xu, and Yingying Fan

NeurIPS 2025 Workshop MLxOR, Nov 2025

Summary Published Version

We study offline policy learning under market shift and privacy protection. Motivated by high-stakes pricing for new products, where price experimentation is infeasible, we leverage historical transaction data from heterogeneous, privacy-protected sources. We model heterogeneity via a covariate shift assumption, where the relationship between price, features, and revenue remains invariant, and privacy through local differential privacy (LDP), where each data point is perturbed before use. Viewing both as distributional shifts, we design a policy learning algorithm grounded in the pessimism principle of offline reinforcement learning. Without privacy, our predict-then-optimize approach constructs a pessimistic revenue predictor and optimizes it to set prices, achieving minimax-optimal decision error. Under LDP, we apply the Laplace mechanism and adapt the pessimistic revenue predictor to account for additional uncertainty introduced by privacy noise. The resulting doubly pessimistic objective is then optimized to determine the final pricing policy.