Principles and practice of nonlinear dimensionality reduction popularly used in neuroscience.
Input: Data in ambient dimension (of any type that a kernel can be defined).
Output: $k$-dimensional Euclidean vectors representing the data on the underlying manifold.
For Euclidean data, the entries of the distance matrix are defined as: \(D(x_i, x_j) = \sqrt{\|x_i - x_j\|^2}\) However, the data can be of any type, provided a suitable kernel can be defined to measure similarity. In the distance matrix below, we observe a banded pattern extending across the entire matrix. indicating a global structure.
The pairwise distances are then passed through a nonlinear kernel function to calculate affinities. The nonlinearity effectively preserves local geometric information by giving higher weight to nearby points. For example, using a Gaussian kernel: \(A(x_i, x_j) = \exp\left(-\frac{D(x_i, x_j)^2}{\sigma}\right)\) In this context, $\sigma$ (or $\epsilon$) acts as a scale parameter that determines the size of the local neighborhood. There are many other kernels other than Gaussian kernel.
The Affinity matrix below show that global structure is partially removed (points are arranged according to the underlying swiss roll structure). The graph shows that there still exists unncessary edges.
div class=”fake-img l-body”> 
</div>
The affinity matrix is then row-normalized to create the stochastic transition matrix $P = D^{-1} A$, the affinity matrix pre-multiplied by the degree inverse matrix. This matrix defines the probability of a “random walk” jumping from one data point to another in a single step.
This transition matrix $P$ acts as a Diffusion Operator with several key properties:
Why is $P$ called a diffusion operator? Because it diffuses heat! Since $P$ is a $n \times n$ matrix, it is a linear operator mapping a vector $f \in \mathbb{R}^n$ onto another vector in $\mathbb{R}^n$.
At this stage, diffusion map embedding also does eigendecomposition, but on a $n \times n$ matrix, not a $p \times p$ matrix. We compare the process with PCA.
PCA operates on the feature space. We first compute the covariance matrix $\Sigma = \frac{1}{n-1}X^TX$, which is a $p \times p$ matrix. We then perform the eigendecomposition: \(\Sigma = U \Lambda U^T\) We retain the first $k$ components to form $U_k$. The low-dimensional embedding is then calculated via projection (inner product): \(X_{low} = XU_k\) By design, this imbedding preserves the covariance matrix.
In contrast, Diffusion Maps operate on the sample space. Here, we use a stochastic transition matrix $M$, which is $n \times n$ (where $n$ is the number of observations). Now each observation $X_i$ need not be $p$-dimensional Euclidean. It can be of any data type for which a kernel can be defined.
The eigenvector of $M$ is therefore an $n$-vector. For embedding, we do not project, but lookup and scale.