k-means Clustering

Formulating and implementing k-means clustering as a mixed-integer optimization problem.

Problem Definition

Suppose that we have $n$ data points ${\mathbf{x}i}{i=1}^n \subseteq \mathbb{R}^d$ in Euclidean space. We aim to find exact $k$ partitions $S_1, \dots, S_k$ of these $n$ points that minimizes the sum of squared errors (SSE) defined by:

\[\text{SSE}(S_1, \dots, S_k) := \sum_{l=1}^k \sum_{i \in S_l} \|\mathbf{x}_i - \mu_l\|^2,\]
where $\mu_l := \frac{1}{ S_l } \sum_{i \in S_l} \mathbf{x}_i$ is the centroid of cluster $l$. The $k$-means clustering problem can be formulated as:
\[\begin{align} \min_{\mu_1, \dots, \mu_k, S_1, \dots, S_k} & \sum_{l=1}^k \sum_{i \in S_l} \|\mathbf{x}_i - \mu_l\|^2 \\ \text{s.t.} & \quad S_1, \dots, S_k \text{ form a partition of } \{1, \dots, n\}, \\ & \quad \mu_1, \dots, \mu_k \in \mathbb{R}^d. \end{align}\]
Note that any optimal solution $\mu_1^, \dots, \mu_k^, S_1^, \dots, S_k^$ satisfies the centroid condition $\mu_l^* = \frac{1}{ S_l^* } \sum_{i \in S_l^*} \mathbf{x}_i$; therefore, we do not need to explicitly impose this as a constraint.

Mixed-Integer Formulation ($k=2$)

In this problem, we will formulate and implement the $k$-means problem step-by-step for the case where $k = 2$.

Variables

We define the following variables to linearize the cluster assignments:

Objective

Because of the definition of $\mathbf{r}{i}^l$, the objective function is defined as $\sum{l=1}^2 \sum_{i=1}^n |\mathbf{r}_{i}^l |^2$.

Constraints

Big-M Formulation for Eraser Vectors

Eraser vectors are constrained using a large constant (M > 0):

\[- M (1 - z_i) \mathbf{1} \le \mathbf{w}_i^1 \le M (1 - z_i) \mathbf{1}, \quad \forall i\] \[- M z_i \mathbf{1} \le \mathbf{w}_i^2 \le M z_i \mathbf{1}, \quad \forall i\]

Interpretation:


Relationship Between Error and Eraser Vectors

The error vectors are defined as:

\[\mathbf{r}_i^l = \mathbf{x}_i - \mu_l - \mathbf{w}_i^l, \quad \forall i, l\]

Interpretation:

If (\mathbf{w}_i^l) can be nonzero, the optimizer will set (\mathbf{w}_i^l = \mathbf{x}_i - \mu_l) when the point is not assigned to cluster (l), effectively zeroing out the corresponding error.

This ensures that errors are counted only for points actually assigned to a given cluster.

Interpretation:

3. Implementation

We ran the k-means model using Gurobi with a 5-minute time limit.

To set up the problem, we carefully chose Big-M values based on the spatial distribution of the Ruspini dataset and set (M = 1 \times 10^2). The Big-M is used to enforce the eraser vectors in the MIQCP formulation.


Run Results

After 5 minutes, the solver’s gap remained large:

Metric Value
Best Objective 1,757,688
Best Bound 1,592,261
Gap (%) 62.4%
Nodes Explored 265
Runtime (s) 275

This indicates that the MIQCP is still hard to solve exactly, and we may need to tighten the formulation for better convergence.


Centroids Found

The solver returned the following centroids:

```python array([[ 66.32243059, 129.07219318], [ 41.92044401, 48.09716577]])

   to visualize: /assets/img/kmeans_out.png

actuakly visually it olooks good. maybe setting cluster number as 2 was mistkae.