Modern Statistical Inference

Course Description

This is a Ph.D. level lecture course exposing students to modern ideas in statistical theory. While the traditional theory assumes that one has access to many observations (large sample size) and the number of variables (features) is fixed, in the modern big data era we are able to collect fine-grained information on each individual that allows us to fit feature-rich models. Our emphasis will be on statistical inference for such high-dimensional settings where there may be many or significantly more variables than observations. We will cover ideas to gauge reliability of statistical methods and the reproducibility of findings.

As an example, suppose that you develop a machine learning system to make personalized predictions (such as risk score for a patient). Consider the following questions:

How certain should we be about predictions made by your algorithm?
How certain should we be about the discovered associations between the variables and the response? Are they statistically significant?
How fair is your prediction with respect to minority subgroups in the data?

This class will serve as a rapid introduction to current topics in statistical learning with a focus on theory and methodology. The course will cover:

Testing problems in high dimensions: Bonferroni’s method, Fisher’s test, higher criticism
Multiple testing problems: family-wise error rate (FWER), procedures for controlling FWER, false discovery rate (FDR), procedures for controlling FDR, online control of FDR
Conformal prediction, conformalized quantile regression and its applications
Conditional randomization test
Gaussian Comparison inequalities: Slepian’s inequality, Gaussian interpolation, Gordon’s theorem
Applications of Gaussian inequalities for analyzing statistical behavior of M-estimators

Learning Objectives

Upon successful completion of this course, students will be able to:

Speak comfortably about statistical significance, p-values, confidence intervals.
Evaluate statistical inference for high-dimensional settings, where there may be many or significantly more variables than observations, to improve the reliability of statistical methods and the reproducibility of findings, including any discovered associations between the variables and the response.
Describe and analyze various statistical tests and their applications: Bonferroni’s global test, Fisher’s test, chi-square test, higher criticism.
Identify multiple testing problems: family-wise error rate (FWER), procedures for controlling FWER, false discovery rate (FDR), procedures for controlling FDR, online control of FDR.
Explain conformal prediction, conformalized quantile regression and its applications.
Perform conditional randomization test to properly account for confounding factors.
Prove Gaussian Comparison inequalities: Slepian’s inequality, Gaussian Interpolation and Gordon’s theorem.
Apply Gaussian inequalities to derive precise asymptotic characterization of the statistical behavior of M-estimators.

Required Materials

We will not follow a text book, but as a PhD class, you are highly encouraged to consult outside sources to supplement your learning. The following references may be useful for background readings:

Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction by B. Efron, IMS Monographs.
Testing Statistical Hypotheses, Third edition, E. L. Lehmann, Joseph P. Romano, Springer Science & Business Media.
High-dimensional probability: An introduction with applications in data science. Vol. 47. Vershynin, Roman. Cambridge university press, 2018.

Modern Statistical Inference

Course Description

Learning Objectives

Required Materials

Project Updates

FDR control: Storey's procedure