Formulating and implementing k-means clustering as a mixed-integer optimization problem.
Suppose that we have $n$ data points ${\mathbf{x}i}{i=1}^n \subseteq \mathbb{R}^d$ in Euclidean space. We aim to find exact $k$ partitions $S_1, \dots, S_k$ of these $n$ points that minimizes the sum of squared errors (SSE) defined by:
\[\text{SSE}(S_1, \dots, S_k) := \sum_{l=1}^k \sum_{i \in S_l} \|\mathbf{x}_i - \mu_l\|^2,\]| where $\mu_l := \frac{1}{ | S_l | } \sum_{i \in S_l} \mathbf{x}_i$ is the centroid of cluster $l$. The $k$-means clustering problem can be formulated as: |
| Note that any optimal solution $\mu_1^, \dots, \mu_k^, S_1^, \dots, S_k^$ satisfies the centroid condition $\mu_l^* = \frac{1}{ | S_l^* | } \sum_{i \in S_l^*} \mathbf{x}_i$; therefore, we do not need to explicitly impose this as a constraint. |
In this problem, we will formulate and implement the $k$-means problem step-by-step for the case where $k = 2$.
We define the following variables to linearize the cluster assignments:
Because of the definition of $\mathbf{r}{i}^l$, the objective function is defined as $\sum{l=1}^2 \sum_{i=1}^n |\mathbf{r}_{i}^l |^2$.
Eraser vectors are constrained using a large constant (M > 0):
\[- M (1 - z_i) \mathbf{1} \le \mathbf{w}_i^1 \le M (1 - z_i) \mathbf{1}, \quad \forall i\] \[- M z_i \mathbf{1} \le \mathbf{w}_i^2 \le M z_i \mathbf{1}, \quad \forall i\]Interpretation:
The error vectors are defined as:
\[\mathbf{r}_i^l = \mathbf{x}_i - \mu_l - \mathbf{w}_i^l, \quad \forall i, l\]Interpretation:
If (\mathbf{w}_i^l) can be nonzero, the optimizer will set (\mathbf{w}_i^l = \mathbf{x}_i - \mu_l) when the point is not assigned to cluster (l), effectively zeroing out the corresponding error.
This ensures that errors are counted only for points actually assigned to a given cluster.
Interpretation:
We ran the k-means model using Gurobi with a 5-minute time limit.
To set up the problem, we carefully chose Big-M values based on the spatial distribution of the Ruspini dataset and set (M = 1 \times 10^2). The Big-M is used to enforce the eraser vectors in the MIQCP formulation.
After 5 minutes, the solver’s gap remained large:
| Metric | Value |
|---|---|
| Best Objective | 1,757,688 |
| Best Bound | 1,592,261 |
| Gap (%) | 62.4% |
| Nodes Explored | 265 |
| Runtime (s) | 275 |
This indicates that the MIQCP is still hard to solve exactly, and we may need to tighten the formulation for better convergence.
The solver returned the following centroids:
```python array([[ 66.32243059, 129.07219318], [ 41.92044401, 48.09716577]])
to visualize: /assets/img/kmeans_out.png
actuakly visually it olooks good. maybe setting cluster number as 2 was mistkae.