Causal inference of sex differences in autism spectrum disorder

This project addresses the statistical challenge of estimating heterogeneous treatment effects (HTEs) in Autism Spectrum Disorder (ASD), where female participants are significantly underrepresented using class imbalance adjusted functional connectivity. By employing a cluster-aware generative model for oversampling, we improve the stability of estimates for the female subgroup.

Our findings reveal significant sex-specific functional connectivity patterns associated with higher-order cognitive control functions, highlighting the importance of proper statistical handling of imbalanced biomedical data.

1. Motivation: The Causal Estimand

The primary objective of this study was to estimate the Heterogeneous Treatment Effect (HTE) of Autism Spectrum Disorder (ASD) conditional on biological sex.

Let $T$ denote the treatment (diagnosis: ASD=1, Control=0), $X$ denote the covariate (Sex: Male/Female), and $Y$ denote the outcome (Functional Gradient Score, a low-dimensional embedding of brain connectivity). Even after dimensionality reduction, we have several region of interst, so we can consider $Y$ as a high-dimensional outcome.

The target estimand is the Interaction Effect, defined as:

\[\tau_{interaction} = \underbrace{(E[Y|T=1, X=F] - E[Y|T=0, X=F])}_{\text{Causal Effect in Females}} - \underbrace{(E[Y|T=1, X=M] - E[Y|T=0, X=M])}_{\text{Causal Effect in Males}}\]

We hypothesized that the treatment effect of ASD on functional brain organization differs significantly between sexes.

2. The Identification Challenge: Practical Positivity Violation

While the theoretical assumption of Positivity ($0 < P(T=1|X) < 1$) holds, the study faces a severe class imbalance problem common in observational ASD research.

The Imbalance: The prevalence of ASD is significantly higher in males, with ratios ranging from 4:1 to 9:1.
The Consequence: The Female/ASD stratum is extremely sparse, representing approximately 12% of the total sample in this dataset.

In a finite sample regime, this sparsity compromises the estimation of the conditional expectation $E[Y

T=1, X=F]$. Standard non-parametric estimators like Inverse Probability Weighting (IPW) might assign extreme weights to the few observed female cases, pontentially containing outliers. This would inflate the variance of the estimator, rendering confidence intervals too wide to detect significant interaction effects.

3. Methodological Contribution: Parametric Variance Reduction

To address the bias-variance trade-off inherent in the sparse stratum, the authors proposed a Gaussian Mixture Model (GMM)-based Oversampling technique[cite: 24, 64].

A. The Estimator

Instead of re-weighting existing data points (as in IPW), the authors constructed a synthetic pseudo-population to balance the strata ($N_{female} \rightarrow N_{male}$).

Distribution Estimation: The probability density function of the sparse class, $f(Y|X=F)$, was estimated using a parametric GMM[cite: 110, 111]: $f(x) = \sum_{j=1}^{J} \pi_j \psi(x | \mu_j, \sigma_j^2)$ where $\psi$ represents Gaussian density components and parameters are optimized via the Expectation-Maximization (EM) algorithm.
Generative Oversampling: Synthetic observations $\hat{Y}$ were drawn from the estimated density $f(x)$ until the sample sizes were balanced.

B. Statistical Justification (vs. Undersampling/IPW)

From a causal inference perspective, this method acts as a variance stabilization technique:

Undersampling: Standard approaches often undersample the majority class (males), which introduces additional randomness and discards information.
GMM Oversampling: By injecting a smoothness assumption (that brain gradients follow a mixture of Gaussians), the method effectively “fills in the gaps” of the sparse female data space.
Asymptotic Validity: The authors argue that because GMMs are universal approximators for smooth densities, the expanded dataset converges to the true distribution in the asymptotic regime, allowing for valid Type I error control despite the added randomness of synthesis.

4. Assessing Validity: Parametric Assumptions

To claim the estimand is identifiable and the parametric assumption valid, the study employed the following validation steps:

Consistency & Exchangeability: The study implicitly assumes conditional exchangeability by controlling for Age and Site during the oversampling process.
Parametric Fit Validation (Wasserstein Distance): To justify replacing the empirical distribution with the GMM approximation, the authors calculated the Wasserstein Distance between the histograms of the actual female gradients and the synthetic gradients.
- Result: The distance was consistently low, providing empirical evidence that the functional gradients are well-approximated by the Gaussian mixture, thus minimizing the bias introduced by the parametric assumption.

5. Results: Uncovering Hidden Heterogeneity

By stabilizing the estimator variance, the study detected significant Heterogeneous Treatment Effects that were previously obscured by the noise of the imbalanced data.

Finding: Significant interaction effects were identified in the Somatomotor ($P_{FDR}=0.029$), Dorsal Attention ($P_{FDR}=0.029$), and Default Mode Networks (DMN) ($P_{FDR}<0.001$).
Direction of Effect: Female participants with ASD showed large shifts in gradient values (reduced values in DMN), whereas male participants showed smaller or opposite shifts.
Conclusion: The oversampling approach increased statistical power, revealing that female susceptibility to ASD involves distinct alterations in higher-order cognitive control networks.

Functional connectivity refers to the correlations between neural signal time series across different brain regions, in contrast to the physical (structural) connections between them. We are interested in understanding how autism causally affects functional connectivity, and whether this effect differs between sexes. From a statistical perspective, this problem corresponds to estimating heterogeneous treatment effects (HTEs). Because autism spectrum disorder (ASD) is diagnosed more frequently in males than in females, the number of female participants in the autism group is substantially smaller. As a result, female samples contribute less to the HTE estimation algorithm, leading to potentially unstable estimates for females. To address this subgroup imbalance, we employ oversampling using a cluster-aware generative model. The augmented data increase the representation of female samples in the HTE estimation process, thereby improving the stability and reliability of HTE estimates for females. Our results reveal that several brain regions exhibit significant HTEs that were not detectable prior to oversampling. Furthermore, we demonstrate that the identified HTEs are not random noise but have clear biological meaning: they are strongly associated with higher-order cognitive control functions, rather than lower-level sensory processing, and are linked to autism-related genes.

Causal inference of sex differences in autism spectrum disorder

1. Motivation: The Causal Estimand

2. The Identification Challenge: Practical Positivity Violation

3. Methodological Contribution: Parametric Variance Reduction

A. The Estimator

B. Statistical Justification (vs. Undersampling/IPW)

4. Assessing Validity: Parametric Assumptions

5. Results: Uncovering Hidden Heterogeneity

References

class imbalance, computational neuroscience, causal inference

Project Updates

Fundamentals of Diffusion Map Embedding

Data Analysis Detail: Subsequent Analysis for Biological Interpretation

Fundamentals of Functional Connectivity and Gradients