Offline Dynamic Pricing under Covariate Shift and Local Differential Privacy via Twofold Pessimism

We propose a DM (direct method)-type transfer learning algorithm for learning continuous treatment assignment policy from offline data under local differential privacy.

We study pricing policy learning from batched contextual bandit data under market shift and privacy protection. Market shift is modeled as covariate shift, where the relationship among treatments, features, and rewards remains invariant, while privacy is enforced through local differential privacy (LDP), which perturbs each data point before use. Viewing the off-policy setting, covariate shift, and LDP collectively as forms of distributional shift, we develop a policy learning algorithm based on a unified pessimism principle that addresses all three shifts. Without privacy, we estimate the conditional reward via nonparametric regression and quantify its variance to construct a pessimistic estimator, yielding a policy with minimax-optimal decision error. Under LDP, we apply the Laplace mechanism and adjust the pessimistic estimator to account for additional uncertainty from privacy noise. The resulting doubly pessimistic objective is then optimized to determine the final pricing policy.

This work was presented at NeurIPS 2025 MLxOR workshop. You can also check out the full paper here.

References

differential privacy, nonparametric statistics, reinforcement learning/bandits, causal inference

  1. Offline Dynamic Pricing under Covariate Shift and Local Differential Privacy via Twofold Pessimism
    Jongmin Mun, Xiaocong Xu, and Yingying Fan
    NeurIPS 2025 Workshop MLxOR, Nov 2025