Notes by Rachael Phillips for PB HLTH 290, Spring 2019
An asymptotically linear estimator with influence curve equal to the efficient influence curve is optimal in the sense that there is no other asymptotically linear estimator with influence curve with a smaller variance. We call this estimator asymptotically efficient.
Data and Model: \(O_1, \dots, O_n \overset{iid}{\sim} P_0 \in \mathcal{M}.\) Here, \(\mathcal{M}\) denotes the statistical model, which is the collection of all possible probability distributions \(P\) that could generate the data.
Target Parameter: The target parameter is defined as a functional (or operator) \(\Psi: \mathcal{M} \to \mathbb{R}\). This mapping takes a probability distribution \(P\) as input and returns a scalar value representing a specific feature of that distribution (e.g., the mean, the risk difference).
Estimand: The true value of the parameter, often denoted as \(\psi_0 = \Psi(P_0)\), is the estimand. This is an unknown quantity because the true data-generating distribution \(P_0\) is unknown.
The Problem: The “Straight Line” Fallacy. We want to measure the sensitivity (“steepness”) of the functional \(\Psi\) at the distribution \(P\). In standard calculus, we would simply take a derivative along a straight line (\(P + \epsilon Q\)). However, the space of probability distributions is curved, not flat (it is not a vector space). If we try to walk in a straight line off of \(P\), we immediately land in “invalid territory” (e.g., generating negative probabilities or measures that do not sum to one).
The Solution: The Curve-Drawing Machine. To stay within the valid distribution space, we approach \(P\) along smooth curves. We utilize a parametric submodel, constructed by a curve-drawing machine \(P_\epsilon^h\).
The Mechanism: The Chain Rule Analogy. We want to calculate how the parameter \(\Psi\) changes as we move along this curve. We can understand this via a Chain Rule Analogy. While the formal calculus of functionals is more complex, the intuition parallels standard calculus (\(\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\)):
\[\underbrace{\frac{d}{d\epsilon} \Psi(P^h_\epsilon)}_{\text{Total Change}}\bigg|_{\epsilon=0} \approx \underbrace{\text{"Operator Change"}}_{\frac{d\Psi}{dP}} \cdot \underbrace{\text{"Curve Change"}}_{\frac{dP}{d\epsilon}}\]However, mathematically, the components are defined more precisely in the Hilbert space \(L_2(P)\).
Conclusion. The beauty of this approach is that we can separate the geometry of the model from the target parameter. We can pre-compute the “curve part” (the Score \(S_h\)) purely based on the submodel. When we combine it with the Gradient via the inner product, we recover the pathwise derivative we need to study efficiency.
Our primary objective is to estimate the unknown quantity \(\psi_0 = \Psi(P_0)\) and to understand the fundamental limits of estimation accuracy. The properties of the functional \(\Psi\) itself dictate the difficulty of the estimation problem.
Motivation: Valid Directions. When defining the derivative of a target parameter, we cannot simply look at arbitrary perturbations \(P + \epsilon h\) (as in standard calculus). The resulting object \(P + \epsilon h\) might not be a valid probability distribution (e.g., it might not integrate to 1 or could be negative). Therefore, we must restrict our attention to perturbations within the space of valid probabilities. We achieve this by defining parametric submodels.
For a specific path \(h\), we define a one-dimensional parametric submodel passing through the true distribution \(P\):
\[\mathcal{M}_h(P) = \{ P^h_{\epsilon} : \epsilon \in (-\delta, \delta) \} \subset \mathcal{M}\]This submodel (collection of distributions) is a curve within the large model \(\mathcal{M}\) such that:
The only direction we care about each submodel, \(\mathcal{M}_h(P)\), is its score. For a path \(h\), its score \(S_h\) is defined as a transformation of an observation:
\[S_h(O)=\left . \frac{d}{d\epsilon}\log dP_{\epsilon}^h/dP(O)\right |_{\epsilon=0}\]| Notice that the score is defined as usual. We take the log of the density that is defined with respect to \(P\) itself. In other words, you choose the path where all the probability distributions are of the same nature as \(P\) itself so that you can define \(\frac{dP_{\epsilon}}{dP}\). Then we have a collection of densities because \(\frac{dP_{\epsilon}}{dP} = p_{\epsilon}^h\) so we have that $$S_h(O)=\frac{d}{d\epsilon}\log p_{\epsilon}^h | _{\epsilon=0}$$. |
We consider the class of all parametric submodels \(\{\mathcal{M}_h(P) : h \in \mathcal{H}\}\), indexed by a set of paths \(\mathcal{H}\). Let \(\mathcal{S} = \{ S_h : h \in \mathcal{H} \}\) be the collection of all score functions generated by these paths. We should be careful about \(\mathcal{H}\).
Scores as Random Variables. Scores are measurable functions of the data \(O \sim P\). Therefore, they are random variables with specific properties:
Hilbert Space. We define \(L^2_0(P)\) as the Hilbert space containing all such mean-zero, square-integrable functions of \(O\) (thus they are mostly correlated)
\[L^2_0(P) = \{ f(O) : \mathbb{E}_P[f(O)]=0, \, \mathbb{E}_P[f(O)^2] < \infty \}\]with inner product defined as the covariance (since they are centered):
\[\langle f, g \rangle_P = \mathbb{E}_P[ f(O)g(O) ] = \mathrm{Cov}_P(f,g)\]Scores belong to \(L_0^2(P)\).
Orthogonality. In this space, two functions are orthogonal (\(f \perp g\)) when their corresponding random variables are uncorrelated. Since our limiting distribution is Gaussian, later this will also mean independence.
Projection. Projection is the bread and butter for the Hilbert space.
Model. \(\mathcal{M}\) is nonparametric. Here we define it as a collection of all probability distributions which have densities.
Direction \(h(o)\).
Submodel. We define \(P_\epsilon^h\) so that \(dP_{\epsilon}(o)=(1+\epsilon h(o)) dP(o)\). Defined via densities.
Intuition: To add probability mass to one area (where \(h > 0\)), we must steal it from another area (where \(h < 0\)) to keep the total mass constant.
Therefore, if we restrict \(\epsilon\) to be smaller than \(\delta = 1/\|h\|_\infty\), i.e. \(\epsilon\in (-\delta,\delta)\) with \(\delta=1/\|h\|_{\infty}\), this is a submodel \(\mathcal{M}_h(P)\).
Score. This construction perfectly yields the score \(h\). By the construction \(dP_{\epsilon} = (1+\epsilon h) dP\):
\[S(O) = \frac{d}{d\epsilon} \log \big( \frac{(1+\epsilon h(O)) dP(O)}{dP(O)} \big) \bigg|_{\epsilon=0}\]The derivative of \(\log(u)\) is \(u'/u\):
\[S(O) = \frac{h(O)}{1+\epsilon h(O)} \bigg|_{\epsilon=0}\]Set \(\epsilon=0\):
\[S(O) = \frac{h(O)}{1} = h(O)\]Scores. \(\mathcal{S}\) is all \(h\in L^2_0(P)\) with \(\|h\|_{\infty}<\infty\).
We have that the score is an element of the Hilbert space and we have a collection of scores that correspond with this class of paths, generating a sub-Hilbert space of \(L_0^2(P)\). We might take any linear combination of all the scores and the closure (any function you can approximate as an a limit of such linear combinations of all these scores is also additive) and that creates a sub-Hilbert space, \(H\) of \(L_0^2(P)\). \(H\) is the tangent space corresponding with this class of paths.
where gradient
\[D(P)(T)=I(T>5)-\Psi(P)\]\(T(P)=L_0^2(P)\) so the orthogonal complement of the tangent space is empty meaning you cannot add to the canonical gradient anything to create more gradients.