\begin{comment} | |

Licensed to the Apache Software Foundation (ASF) under one | |

or more contributor license agreements. See the NOTICE file | |

distributed with this work for additional information | |

regarding copyright ownership. The ASF licenses this file | |

to you under the Apache License, Version 2.0 (the | |

"License"); you may not use this file except in compliance | |

with the License. You may obtain a copy of the License at | |

http://www.apache.org/licenses/LICENSE-2.0 | |

Unless required by applicable law or agreed to in writing, | |

software distributed under the License is distributed on an | |

"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | |

KIND, either express or implied. See the License for the | |

specific language governing permissions and limitations | |

under the License. | |

\end{comment} | |

\subsection{K-Means Clustering} | |

\noindent{\bf Description} | |

\smallskip | |

Given a collection of $n$ records with a pairwise similarity measure, | |

the goal of clustering is to assign a category label to each record so that | |

similar records tend to get the same label. In contrast to multinomial | |

logistic regression, clustering is an \emph{unsupervised}\/ learning problem | |

with neither category assignments nor label interpretations given in advance. | |

In $k$-means clustering, the records $x_1, x_2, \ldots, x_n$ are numerical | |

feature vectors of $\dim x_i = m$ with the squared Euclidean distance | |

$\|x_i - x_{i'}\|_2^2$ as the similarity measure. We want to partition | |

$\{x_1, \ldots, x_n\}$ into $k$ clusters $\{S_1, \ldots, S_k\}$ so that | |

the aggregated squared distance from records to their cluster means is | |

minimized: | |

\begin{equation} | |

\textrm{WCSS}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mean(S_j: x_i\in S_j)\big\|_2^2 \,\,\to\,\,\min | |

\label{eqn:WCSS} | |

\end{equation} | |

The aggregated distance measure in~(\ref{eqn:WCSS}) is called the | |

\emph{within-cluster sum of squares}~(WCSS). It can be viewed as a measure | |

of residual variance that remains in the data after the clustering assignment, | |

conceptually similar to the residual sum of squares~(RSS) in linear regression. | |

However, unlike for the RSS, the minimization of~(\ref{eqn:WCSS}) is an NP-hard | |

problem~\cite{AloiseDHP2009:kmeans}. | |

Rather than searching for the global optimum in~(\ref{eqn:WCSS}), a heuristic algorithm | |

called Lloyd's algorithm is typically used. This iterative algorithm maintains | |

and updates a set of $k$~\emph{centroids} $\{c_1, \ldots, c_k\}$, one centroid per cluster. | |

It defines each cluster $S_j$ as the set of all records closer to~$c_j$ than | |

to any other centroid. Each iteration of the algorithm reduces the WCSS in two steps: | |

\begin{Enumerate} | |

\item Assign each record to the closest centroid, making $\mean(S_j)\neq c_j$; | |

\label{step:kmeans:recluster} | |

\item Reset each centroid to its cluster's mean: $c_j := \mean(S_j)$. | |

\label{step:kmeans:recenter} | |

\end{Enumerate} | |

After Step~\ref{step:kmeans:recluster} the centroids are generally different from the cluster | |

means, so we can compute another ``within-cluster sum of squares'' based on the centroids: | |

\begin{equation} | |

\textrm{WCSS\_C}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mathop{\textrm{centroid}}(S_j: x_i\in S_j)\big\|_2^2 | |

\label{eqn:WCSS:C} | |

\end{equation} | |

This WCSS\_C after Step~\ref{step:kmeans:recluster} is less than the means-based WCSS | |

before Step~\ref{step:kmeans:recluster} (or equal if convergence achieved), and in | |

Step~\ref{step:kmeans:recenter} the WCSS cannot exceed the WCSS\_C for \emph{the same} | |

clustering; hence the WCSS reduction. | |

Exact convergence is reached when each record becomes closer to its | |

cluster's mean than to any other cluster's mean, so there are no more re-assignments | |

and the centroids coincide with the means. In practice, iterations may be stopped | |

when the reduction in WCSS (or in WCSS\_C) falls below a minimum threshold, or upon | |

reaching the maximum number of iterations. The initialization of the centroids is also | |

an important part of the algorithm. The smallest WCSS obtained by the algorithm is not | |

the global minimum and varies depending on the initial centroids. We implement multiple | |

parallel runs with different initial centroids and report the best result. | |

\Paragraph{Scoring} | |

Our scoring script evaluates the clustering output by comparing it with a known category | |

assignment. Since cluster labels have no prior correspondence to the categories, we | |

cannot count ``correct'' and ``wrong'' cluster assignments. Instead, we quantify them in | |

two ways: | |

\begin{Enumerate} | |

\item Count how many same-category and different-category pairs of records end up in the | |

same cluster or in different clusters; | |

\item For each category, count the prevalence of its most common cluster; for each | |

cluster, count the prevalence of its most common category. | |

\end{Enumerate} | |

The number of categories and the number of clusters ($k$) do not have to be equal. | |

A same-category pair of records clustered into the same cluster is viewed as a | |

``true positive,'' a different-category pair clustered together is a ``false positive,'' | |

a same-category pair clustered apart is a ``false negative''~etc. | |

\smallskip | |

\noindent{\bf Usage: K-means Script} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}Kmeans.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} C=}path/file | |

{\tt{} k=}int | |

{\tt{} runs=}int | |

{\tt{} maxi=}int | |

{\tt{} tol=}double | |

{\tt{} samp=}int | |

{\tt{} isY=}int | |

{\tt{} Y=}path/file | |

{\tt{} fmt=}format | |

{\tt{} verb=}int | |

} | |

\smallskip | |

\noindent{\bf Usage: K-means Scoring/Prediction} | |

\smallskip | |

{\hangindent=\parindent\noindent\it% | |

{\tt{}-f }path/\/{\tt{}Kmeans-predict.dml} | |

{\tt{} -nvargs} | |

{\tt{} X=}path/file | |

{\tt{} C=}path/file | |

{\tt{} spY=}path/file | |

{\tt{} prY=}path/file | |

{\tt{} fmt=}format | |

{\tt{} O=}path/file | |

} | |

\smallskip | |

\noindent{\bf Arguments} | |

\begin{Description} | |

\item[{\tt X}:] | |

Location to read matrix $X$ with the input data records as rows | |

\item[{\tt C}:] (default:\mbox{ }{\tt "C.mtx"}) | |

Location to store the output matrix with the best available cluster centroids as rows | |

\item[{\tt k}:] | |

Number of clusters (and centroids) | |

\item[{\tt runs}:] (default:\mbox{ }{\tt 10}) | |

Number of parallel runs, each run with different initial centroids | |

\item[{\tt maxi}:] (default:\mbox{ }{\tt 1000}) | |

Maximum number of iterations per run | |

\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}) | |

Tolerance (epsilon) for single-iteration WCSS\_C change ratio | |

\item[{\tt samp}:] (default:\mbox{ }{\tt 50}) | |

Average number of records per centroid in data samples used in the centroid | |

initialization procedure | |

\item[{\tt Y}:] (default:\mbox{ }{\tt "Y.mtx"}) | |

Location to store the one-column matrix $Y$ with the best available mapping of | |

records to clusters (defined by the output centroids) | |

\item[{\tt isY}:] (default:\mbox{ }{\tt 0}) | |

{\tt 0} = do not write matrix~$Y$, {\tt 1} = write~$Y$ | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; | |

see read/write functions in SystemML Language Reference for details. | |

\item[{\tt verb}:] (default:\mbox{ }{\tt 0}) | |

{\tt 0} = do not print per-iteration statistics for each run, {\tt 1} = print them | |

(the ``verbose'' option) | |

\end{Description} | |

\smallskip | |

\noindent{\bf Arguments --- Scoring/Prediction} | |

\begin{Description} | |

\item[{\tt X}:] (default:\mbox{ }{\tt " "}) | |

Location to read matrix $X$ with the input data records as rows, | |

optional when {\tt prY} input is provided | |

\item[{\tt C}:] (default:\mbox{ }{\tt " "}) | |

Location to read matrix $C$ with cluster centroids as rows, optional | |

when {\tt prY} input is provided; NOTE: if both {\tt X} and {\tt C} are | |

provided, {\tt prY} is an output, not input | |

\item[{\tt spY}:] (default:\mbox{ }{\tt " "}) | |

Location to read a one-column matrix with the externally specified ``true'' | |

assignment of records (rows) to categories, optional for prediction without | |

scoring | |

\item[{\tt prY}:] (default:\mbox{ }{\tt " "}) | |

Location to read (or write, if {\tt X} and {\tt C} are present) a | |

column-vector with the predicted assignment of rows to clusters; | |

NOTE: No prior correspondence is assumed between the predicted | |

cluster labels and the externally specified categories | |

\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) | |

Matrix file output format for {\tt prY}, such as {\tt text}, {\tt mm}, | |

or {\tt csv}; see read/write functions in SystemML Language Reference | |

for details | |

\item[{\tt O}:] (default:\mbox{ }{\tt " "}) | |

Location to write the output statistics defined in | |

Table~\ref{table:kmeans:predict:stats}, by default print them to the | |

standard output | |

\end{Description} | |

\begin{table}[t]\small\centerline{% | |

\begin{tabular}{|lcl|} | |

\hline | |

Name & CID & Meaning \\ | |

\hline | |

{\tt TSS} & & Total Sum of Squares (from the total mean) \\ | |

{\tt WCSS\_M} & & Within-Cluster Sum of Squares (means as centers) \\ | |

{\tt WCSS\_M\_PC} & & Within-Cluster Sum of Squares (means), in \% of TSS \\ | |

{\tt BCSS\_M} & & Between-Cluster Sum of Squares (means as centers) \\ | |

{\tt BCSS\_M\_PC} & & Between-Cluster Sum of Squares (means), in \% of TSS \\ | |

\hline | |

{\tt WCSS\_C} & & Within-Cluster Sum of Squares (centroids as centers) \\ | |

{\tt WCSS\_C\_PC} & & Within-Cluster Sum of Squares (centroids), \% of TSS \\ | |

{\tt BCSS\_C} & & Between-Cluster Sum of Squares (centroids as centers) \\ | |

{\tt BCSS\_C\_PC} & & Between-Cluster Sum of Squares (centroids), \% of TSS \\ | |

\hline | |

{\tt TRUE\_SAME\_CT} & & Same-category pairs predicted as Same-cluster, count \\ | |

{\tt TRUE\_SAME\_PC} & & Same-category pairs predicted as Same-cluster, \% \\ | |

{\tt TRUE\_DIFF\_CT} & & Diff-category pairs predicted as Diff-cluster, count \\ | |

{\tt TRUE\_DIFF\_PC} & & Diff-category pairs predicted as Diff-cluster, \% \\ | |

{\tt FALSE\_SAME\_CT} & & Diff-category pairs predicted as Same-cluster, count \\ | |

{\tt FALSE\_SAME\_PC} & & Diff-category pairs predicted as Same-cluster, \% \\ | |

{\tt FALSE\_DIFF\_CT} & & Same-category pairs predicted as Diff-cluster, count \\ | |

{\tt FALSE\_DIFF\_PC} & & Same-category pairs predicted as Diff-cluster, \% \\ | |

\hline | |

{\tt SPEC\_TO\_PRED} & $+$ & For specified category, the best predicted cluster id \\ | |

{\tt SPEC\_FULL\_CT} & $+$ & For specified category, its full count \\ | |

{\tt SPEC\_MATCH\_CT} & $+$ & For specified category, best-cluster matching count \\ | |

{\tt SPEC\_MATCH\_PC} & $+$ & For specified category, \% of matching to full count \\ | |

{\tt PRED\_TO\_SPEC} & $+$ & For predicted cluster, the best specified category id \\ | |

{\tt PRED\_FULL\_CT} & $+$ & For predicted cluster, its full count \\ | |

{\tt PRED\_MATCH\_CT} & $+$ & For predicted cluster, best-category matching count \\ | |

{\tt PRED\_MATCH\_PC} & $+$ & For predicted cluster, \% of matching to full count \\ | |

\hline | |

\end{tabular}} | |

\caption{The {\tt O}-file for {\tt Kmeans-predict} provides the output statistics | |

in CSV format, one per line, in the following format: (NAME, [CID], VALUE). Note: | |

the 1st group statistics are given if {\tt X} input is available; | |

the 2nd group statistics are given if {\tt X} and {\tt C} inputs are available; | |

the 3rd and 4th group statistics are given if {\tt spY} input is available; | |

only the 4th group statistics contain a nonempty CID value; | |

when present, CID contains either the specified category label or the | |

predicted cluster label.} | |

\label{table:kmeans:predict:stats} | |

\end{table} | |

\noindent{\bf Details} | |

\smallskip | |

Our clustering script proceeds in 3~stages: centroid initialization, | |

parallel $k$-means iterations, and the best-available output generation. | |

Centroids are initialized at random from the input records (the rows of~$X$), | |

biased towards being chosen far apart from each other. The initialization | |

method is based on the {\tt k-means++} heuristic from~\cite{ArthurVassilvitskii2007:kmeans}, | |

with one important difference: to reduce the number of passes through~$X$, | |

we take a small sample of $X$ and run the {\tt k-means++} heuristic over | |

this sample. Here is, conceptually, our centroid initialization algorithm | |

for one clustering run: | |

\begin{Enumerate} | |

\item Sample the rows of~$X$ uniformly at random, picking each row with probability | |

$p = ks / n$ where | |

\begin{Itemize} | |

\item $k$~is the number of centroids, | |

\item $n$~is the number of records, and | |

\item $s$~is the {\tt samp} input parameter. | |

\end{Itemize} | |

If $ks \geq n$, the entire $X$ is used in place of its sample. | |

\item Choose the first centroid uniformly at random from the sampled rows. | |

\item Choose each subsequent centroid from the sampled rows, at random, with | |

probability proportional to the squared Euclidean distance between the row and | |

the nearest already-chosen centroid. | |

\end{Enumerate} | |

The sampling of $X$ and the selection of centroids are performed independently | |

and in parallel for each run of the $k$-means algorithm. When we sample the | |

rows of~$X$, rather than tossing a random coin for each row, we compute the | |

number of rows to skip until the next sampled row as $\lceil \log(u) / \log(1 - p) \rceil$ | |

where $u\in (0, 1)$ is uniformly random. This time-saving trick works because | |

\begin{equation*} | |

\Prob [k-1 < \log_{1-p}(u) < k] \,\,=\,\, p(1-p)^{k-1} \,\,=\,\, | |

\Prob [\textrm{skip $k-1$ rows}] | |

\end{equation*} | |

However, it requires us to estimate the maximum sample size, which we set | |

near~$ks + 10\sqrt{ks}$ to make it generous enough. | |

Once we selected the initial centroid sets, we start the $k$-means iterations | |

independently in parallel for all clustering runs. The number of clustering runs | |

is given as the {\tt runs} input parameter. Each iteration of each clustering run | |

performs the following steps: | |

\begin{Itemize} | |

\item Compute the centroid-dependent part of squared Euclidean distances from | |

all records (rows of~$X$) to each of the $k$~centroids using matrix product; | |

\item Take the minimum of the above for each record; | |

\item Update the current within-cluster sum of squares (WCSS) value, with centroids | |

substituted instead of the means for efficiency; | |

\item Check the convergence criterion:\hfil | |

$\textrm{WCSS}_{\mathrm{old}} - \textrm{WCSS}_{\mathrm{new}} < \eps\cdot\textrm{WCSS}_{\mathrm{new}}$\linebreak | |

as well as the number of iterations limit; | |

\item Find the closest centroid for each record, sharing equally any records with multiple | |

closest centroids; | |

\item Compute the number of records closest to each centroid, checking for ``runaway'' | |

centroids with no records left (in which case the run fails); | |

\item Compute the new centroids by averaging the records in their clusters. | |

\end{Itemize} | |

When a termination condition is satisfied, we store the centroids and the WCSS value | |

and exit this run. A run has to satisfy the WCSS convergence criterion to be considered | |

successful. Upon the termination of all runs, we select the smallest WCSS value among | |

the successful runs, and write out this run's centroids. If requested, we also compute | |

the cluster assignment of all records in~$X$, using integers from 1 to~$k$ as the cluster | |

labels. The scoring script can then be used to compare the cluster assignment with | |

an externally specified category assignment. | |

\smallskip | |

\noindent{\bf Returns} | |

\smallskip | |

We output the $k$ centroids for the best available clustering, i.~e.\ whose WCSS | |

is the smallest of all successful runs. | |

The centroids are written as the rows of the $k\,{\times}\,m$-matrix into the output | |

file whose path/name was provided as the ``{\tt C}'' input argument. If the input | |

parameter ``{\tt isY}'' was set to~{\tt 1}, we also output the one-column matrix with | |

the cluster assignment for all the records. This assignment is written into the | |

file whose path/name was provided as the ``{\tt Y}'' input argument. | |

The best WCSS value, as well as some information about the performance of the other | |

runs, is printed during the script execution. The scoring script {\tt Kmeans-predict} | |

prints all its results in a self-explanatory manner, as defined in | |

Table~\ref{table:kmeans:predict:stats}. | |

\smallskip | |

\noindent{\bf Examples} | |

\smallskip | |

{\hangindent=\parindent\noindent\tt | |

\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 C=/user/biadmin/centroids.mtx fmt=csv | |

} | |

{\hangindent=\parindent\noindent\tt | |

\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 runs=100 maxi=5000 | |

tol=0.00000001 samp=20 C=/user/biadmin/centroids.mtx isY=1 Y=/user/biadmin/Yout.mtx verb=1 | |

} | |

\noindent To predict {\tt Y} given {\tt X} and {\tt C}: | |

{\hangindent=\parindent\noindent\tt | |

\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx | |

C=/user/biadmin/C.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv | |

} | |

\noindent To compare ``actual'' labels {\tt spY} with ``predicted'' labels given {\tt X} and {\tt C}: | |

{\hangindent=\parindent\noindent\tt | |

\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx | |

C=/user/biadmin/C.mtx spY=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv | |

} | |

\noindent To compare ``actual'' labels {\tt spY} with given ``predicted'' labels {\tt prY}: | |

{\hangindent=\parindent\noindent\tt | |

\hml -f Kmeans-predict.dml -nvargs spY=/user/biadmin/Y.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv | |

} | |

\smallskip | |

\noindent{\bf References} | |

\begin{itemize} | |

\item | |

D.~Aloise, A.~Deshpande, P.~Hansen, and P.~Popat. | |

\newblock {NP}-hardness of {E}uclidean sum-of-squares clustering. | |

\newblock {\em Machine Learning}, 75(2):245--248, May 2009. | |

\item | |

D.~Arthur and S.~Vassilvitskii. | |

\newblock {\tt k-means++}: The advantages of careful seeding. | |

\newblock In {\em Proceedings of the 18th Annual {ACM-SIAM} Symposium on | |

Discrete Algorithms ({SODA}~2007)}, pages 1027--1035, New Orleans~{LA}, | |

{USA}, January 7--9 2007. | |

\end{itemize} |