Private and Scalable Hash Kernel Bit-flip Two-Sample Testing

Mun et al. (2025) proposed a minimax optimal two-sample (A/B) testing algorithm under local differential privacy (LDP). However, a key challenge arises when dealing with high-dimensional datasets, such as images or text, where the domain size $k$ is massive or unknown.

The resulting multinomial distribution implies that the entry-wise perturbation method of Mun et al. (2025), which relies on a variant of Erlingsson et al. (2014), introduces excessive noise relative to the signal, significantly reducing testing power. Furthermore, conventional dimension reduction methods like PCA are not directly applicable in the LDP setting, as individual data owners cannot access the global covariance structure.

Since the test statistic relies on the Euclidean inner product—a simple linear kernel—we propose utilizing the hash kernel approximation from Shi et al. (2009) to enhance scalability. We present a modified LDP algorithm that projects high-dimensional data into a lower-dimensional sketch before noise injection. This approach enables efficient computation and handles unknown alphabet sizes while preserving the core geometric structure required for the two-sample test.

This project began as the final assignment for CSCI-699: Privacy-Preserving Machine Learning, offered at USC in Fall 2024. The core idea of using hashing was suggested by Sai Praneeth Karimireddy, the lecturer of the course and an assistant professor of computer science at USC. You can also check out the full paper here.

References

Mun, J., Kwak, M., & Kim, I. (2025). Minimax optimal two-sample testing under local differential privacy. Journal of Machine Learning Research.
Erlingsson, Ú., Pihur, V., & Korolova, A. (2014). RAPPOR: Randomized aggregatable privacy-preserving ordinal response. ACM CCS.
Shi, Q., Petterson, J., Drezde, G., Li, X., Smola, A., & Vishwanathan, S. (2009). Hash kernels for structured data. Journal of Machine Learning Research.