| <!DOCTYPE html> |
| <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> |
| <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> |
| <!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> |
| <!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]--> |
| <head> |
| <title>SystemML Algorithms Reference - Clustering - SystemML 1.1.0</title> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| |
| <meta name="viewport" content="width=device-width"> |
| <link rel="stylesheet" href="css/bootstrap.min.css"> |
| <link rel="stylesheet" href="css/main.css"> |
| <link rel="stylesheet" href="css/pygments-default.css"> |
| <link rel="shortcut icon" href="img/favicon.png"> |
| </head> |
| <body> |
| <!--[if lt IE 7]> |
| <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p> |
| <![endif]--> |
| |
| <header class="navbar navbar-default navbar-fixed-top" id="topbar"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <div class="navbar-brand brand projectlogo"> |
| <a href="http://systemml.apache.org/"><img class="logo" src="img/systemml-logo.png" alt="Apache SystemML" title="Apache SystemML"/></a> |
| </div> |
| <div class="navbar-brand brand projecttitle"> |
| <a href="http://systemml.apache.org/">Apache SystemML<sup id="trademark">™</sup></a><br/> |
| <span class="version">1.1.0</span> |
| </div> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| </div> |
| <nav class="navbar-collapse collapse"> |
| <ul class="nav navbar-nav navbar-right"> |
| <li><a href="index.html">Overview</a></li> |
| <li><a href="https://github.com/apache/systemml">GitHub</a></li> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Documentation<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><b>Running SystemML:</b></li> |
| <li><a href="https://github.com/apache/systemml">SystemML GitHub README</a></li> |
| <li><a href="spark-mlcontext-programming-guide.html">Spark MLContext</a></li> |
| <li><a href="spark-batch-mode.html">Spark Batch Mode</a> |
| <li><a href="hadoop-batch-mode.html">Hadoop Batch Mode</a> |
| <li><a href="standalone-guide.html">Standalone Guide</a></li> |
| <li><a href="jmlc.html">Java Machine Learning Connector (JMLC)</a> |
| <li class="divider"></li> |
| <li><b>Language Guides:</b></li> |
| <li><a href="dml-language-reference.html">DML Language Reference</a></li> |
| <li><a href="beginners-guide-to-dml-and-pydml.html">Beginner's Guide to DML and PyDML</a></li> |
| <li><a href="beginners-guide-python.html">Beginner's Guide for Python Users</a></li> |
| <li><a href="python-reference.html">Reference Guide for Python Users</a></li> |
| <li class="divider"></li> |
| <li><b>ML Algorithms:</b></li> |
| <li><a href="algorithms-reference.html">Algorithms Reference</a></li> |
| <li class="divider"></li> |
| <li><b>Tools:</b></li> |
| <li><a href="debugger-guide.html">Debugger Guide</a></li> |
| <li><a href="developer-tools-systemml.html">IDE Guide</a></li> |
| <li class="divider"></li> |
| <li><b>Other:</b></li> |
| <li><a href="contributing-to-systemml.html">Contributing to SystemML</a></li> |
| <li><a href="engine-dev-guide.html">Engine Developer Guide</a></li> |
| <li><a href="troubleshooting-guide.html">Troubleshooting Guide</a></li> |
| <li><a href="release-process.html">Release Process</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><a href="./api/java/index.html">Java</a></li> |
| <li><a href="./api/python/index.html">Python</a></li> |
| </ul> |
| </li> |
| |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Issues<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><b>JIRA:</b></li> |
| <li><a href="https://issues.apache.org/jira/browse/SYSTEMML">SystemML JIRA</a></li> |
| |
| </ul> |
| </li> |
| </ul> |
| </nav> |
| </div> |
| </header> |
| |
| <div class="container" id="content"> |
| |
| <h1 class="title"><a href="algorithms-reference.html">SystemML Algorithms Reference</a></h1> |
| |
| |
| <!-- |
| |
| --> |
| |
| <h1 id="clustering">3. Clustering</h1> |
| |
| <h2 id="k-means-clustering">3.1. K-Means Clustering</h2> |
| |
| <h3 id="description">Description</h3> |
| |
| <p>Given a collection of $n$ records with a pairwise similarity measure, |
| the goal of clustering is to assign a category label to each record so |
| that similar records tend to get the same label. In contrast to |
| multinomial logistic regression, clustering is an <em>unsupervised</em> |
| learning problem with neither category assignments nor label |
| interpretations given in advance. In $k$-means clustering, the records |
| $x_1, x_2, \ldots, x_n$ are numerical feature vectors of $\dim x_i = m$ |
| with the squared Euclidean distance $|x_i - x_{i’}|_2^2$ as the |
| similarity measure. We want to partition $\{x_1, \ldots, x_n\}$ into $k$ |
| clusters $\{S_1, \ldots, S_k\}$ so that the aggregated squared distance |
| from records to their cluster means is minimized:</p> |
| |
| <script type="math/tex; mode=display">\begin{equation} |
| \textrm{WCSS}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - mean(S_j: x_i\in S_j)\big\|_2^2 \,\,\to\,\,\min |
| \end{equation}</script> |
| |
| <p>The aggregated distance measure in (1) is |
| called the <em>within-cluster sum of squares</em> (WCSS). It can be viewed as a |
| measure of residual variance that remains in the data after the |
| clustering assignment, conceptually similar to the residual sum of |
| squares (RSS) in linear regression. However, unlike for the RSS, the |
| minimization of (1) is an NP-hard |
| problem <a href="algorithms-bibliography.html">[AloiseDHP2009]</a>.</p> |
| |
| <p>Rather than searching for the global optimum in (1), a |
| heuristic algorithm called Lloyd’s algorithm is typically used. This |
| iterative algorithm maintains and updates a set of $k$ <em>centroids</em> |
| $\{c_1, \ldots, c_k\}$, one centroid per cluster. It defines each |
| cluster $S_j$ as the set of all records closer to $c_j$ than to any |
| other centroid. Each iteration of the algorithm reduces the WCSS in two |
| steps:</p> |
| |
| <ol> |
| <li>Assign each record to the closest centroid, making |
| $mean(S_j)\neq c_j$</li> |
| <li>Reset each centroid to its cluster’s mean: |
| $c_j := mean(S_j)$</li> |
| </ol> |
| |
| <p>After Step 1, the centroids are generally |
| different from the cluster means, so we can compute another |
| “within-cluster sum of squares” based on the centroids:</p> |
| |
| <script type="math/tex; mode=display">\textrm{WCSS_C}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mathop{\textrm{centroid}}(S_j: x_i\in S_j)\big\|_2^2 |
| \label{eqn:WCSS:C}</script> |
| |
| <p>This WCSS_C after Step 1 |
| is less than the means-based WCSS before Step 1 |
| (or equal if convergence achieved), and in Step 2 |
| the WCSS cannot exceed the WCSS_C for <em>the same</em> clustering; hence the |
| WCSS reduction.</p> |
| |
| <p>Exact convergence is reached when each record becomes closer to its |
| cluster’s mean than to any other cluster’s mean, so there are no more |
| re-assignments and the centroids coincide with the means. In practice, |
| iterations may be stopped when the reduction in WCSS (or in WCSS_C) |
| falls below a minimum threshold, or upon reaching the maximum number of |
| iterations. The initialization of the centroids is also an important |
| part of the algorithm. The smallest WCSS obtained by the algorithm is |
| not the global minimum and varies depending on the initial centroids. We |
| implement multiple parallel runs with different initial centroids and |
| report the best result.</p> |
| |
| <p><strong>Scoring.</strong> Our scoring script evaluates the clustering output by comparing it with |
| a known category assignment. Since cluster labels have no prior |
| correspondence to the categories, we cannot count “correct” and “wrong” |
| cluster assignments. Instead, we quantify them in two ways:</p> |
| |
| <ol> |
| <li>Count how many same-category and different-category pairs of records end |
| up in the same cluster or in different clusters;</li> |
| <li>For each category, count the prevalence of its most common cluster; for |
| each cluster, count the prevalence of its most common category.</li> |
| </ol> |
| |
| <p>The number of categories and the number of clusters ($k$) do not have to |
| be equal. A same-category pair of records clustered into the same |
| cluster is viewed as a “true positive,” a different-category pair |
| clustered together is a “false positive,” a same-category pair clustered |
| apart is a “false negative” etc.</p> |
| |
| <h3 id="usage">Usage</h3> |
| |
| <p><strong>K-Means</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f Kmeans.dml |
| -nvargs X=<file> |
| C=[file] |
| k=<int> |
| runs=[int] |
| maxi=[int] |
| tol=[double] |
| samp=[int] |
| isY=[boolean] |
| Y=[file] |
| fmt=[format] |
| verb=[boolean] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f Kmeans.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=<file> |
| C=[file] |
| k=<int> |
| runs=[int] |
| maxi=[int] |
| tol=[double] |
| samp=[int] |
| isY=[boolean] |
| Y=[file] |
| fmt=[format] |
| verb=[boolean] |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>K-Means Prediction</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f Kmeans-predict.dml |
| -nvargs X=[file] |
| C=[file] |
| spY=[file] |
| prY=[file] |
| fmt=[format] |
| O=[file] |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f Kmeans-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=[file] |
| C=[file] |
| spY=[file] |
| prY=[file] |
| fmt=[format] |
| O=[file] |
| </code></pre> |
| </div> |
| </div> |
| |
| <h3 id="arguments---k-means">Arguments - K-Means</h3> |
| |
| <p><strong>X</strong>: Location to read matrix $X$ with the input data records as rows</p> |
| |
| <p><strong>C</strong>: (default: <code>"C.mtx"</code>) Location to store the output matrix with the best available |
| cluster centroids as rows</p> |
| |
| <p><strong>k</strong>: Number of clusters (and centroids)</p> |
| |
| <p><strong>runs</strong>: (default: <code>10</code>) Number of parallel runs, each run with different initial |
| centroids</p> |
| |
| <p><strong>maxi</strong>: (default: <code>1000</code>) Maximum number of iterations per run</p> |
| |
| <p><strong>tol</strong>: (default: <code>0.000001</code>) Tolerance (epsilon) for single-iteration WCSS_C change ratio</p> |
| |
| <p><strong>samp</strong>: (default: <code>50</code>) Average number of records per centroid in data samples used |
| in the centroid initialization procedure</p> |
| |
| <p><strong>Y</strong>: (default: <code>"Y.mtx"</code>) Location to store the one-column matrix $Y$ with the best |
| available mapping of records to clusters (defined by the output |
| centroids)</p> |
| |
| <p><strong>isY</strong>: (default: <code>FALSE</code>) Do not write matrix $Y$</p> |
| |
| <p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format, such as <code>text</code>, |
| <code>mm</code>, or <code>csv</code>; see read/write functions in |
| SystemML Language Reference for details.</p> |
| |
| <p><strong>verb</strong>: (default: <code>FALSE</code>) Do not print per-iteration statistics for |
| each run</p> |
| |
| <h3 id="arguments---k-means-prediction">Arguments - K-Means Prediction</h3> |
| |
| <p><strong>X</strong>: (default: <code>" "</code>) Location to read matrix $X$ with the input data records as |
| rows, optional when <code>prY</code> input is provided</p> |
| |
| <p><strong>C</strong>: (default: <code>" "</code>) Location to read matrix $C$ with cluster centroids as rows, |
| optional when <code>prY</code> input is provided; NOTE: if both |
| X and C are provided, <code>prY</code> is an |
| output, not input</p> |
| |
| <p><strong>spY</strong>: (default: <code>" "</code>) Location to read a one-column matrix with the externally |
| specified “true” assignment of records (rows) to categories, optional |
| for prediction without scoring</p> |
| |
| <p><strong>prY</strong>: (default: <code>" "</code>) Location to read (or write, if X and |
| C are present) a column-vector with the predicted |
| assignment of rows to clusters; NOTE: No prior correspondence is assumed |
| between the predicted cluster labels and the externally specified |
| categories</p> |
| |
| <p><strong>fmt</strong>: (default: <code>"text"</code>) Matrix file output format for <code>prY</code>, such as |
| <code>text</code>, <code>mm</code>, or <code>csv</code>; see read/write |
| functions in SystemML Language Reference for details.</p> |
| |
| <p><strong>0</strong>: (default: <code>" "</code>) Location to write the output statistics defined in |
| <a href="algorithms-clustering.html#table6"><strong>Table 6</strong></a>, by default print them to the |
| standard output</p> |
| |
| <h3 id="examples">Examples</h3> |
| |
| <p><strong>K-Means</strong>:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f Kmeans.dml |
| -nvargs X=/user/ml/X.mtx |
| k=5 |
| C=/user/ml/centroids.mtx |
| fmt=csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f Kmeans.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| k=5 |
| C=/user/ml/centroids.mtx |
| fmt=csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f Kmeans.dml |
| -nvargs X=/user/ml/X.mtx |
| k=5 |
| runs=100 |
| maxi=5000 |
| tol=0.00000001 |
| samp=20 |
| C=/user/ml/centroids.mtx |
| isY=1 |
| Y=/user/ml/Yout.mtx |
| verb=1 |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f Kmeans.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| k=5 |
| runs=100 |
| maxi=5000 |
| tol=0.00000001 |
| samp=20 |
| C=/user/ml/centroids.mtx |
| isY=1 |
| Y=/user/ml/Yout.mtx |
| verb=1 |
| </code></pre> |
| </div> |
| </div> |
| |
| <p><strong>K-Means Prediction</strong>:</p> |
| |
| <p>To predict Y given X and C:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f Kmeans-predict.dml |
| -nvargs X=/user/ml/X.mtx |
| C=/user/ml/C.mtx |
| prY=/user/ml/PredY.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f Kmeans-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| C=/user/ml/C.mtx |
| prY=/user/ml/PredY.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p>To compare “actual” labels <code>spY</code> with “predicted” labels |
| given X and C:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f Kmeans-predict.dml |
| -nvargs X=/user/ml/X.mtx |
| C=/user/ml/C.mtx |
| spY=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f Kmeans-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs X=/user/ml/X.mtx |
| C=/user/ml/C.mtx |
| spY=/user/ml/Y.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <p>To compare “actual” labels <code>spY</code> with given “predicted” |
| labels prY:</p> |
| |
| <div class="codetabs"> |
| <div data-lang="Hadoop"> |
| <pre><code>hadoop jar SystemML.jar -f Kmeans-predict.dml |
| -nvargs spY=/user/ml/Y.mtx |
| prY=/user/ml/PredY.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| <div data-lang="Spark"> |
| <pre><code>$SPARK_HOME/bin/spark-submit --master yarn |
| --deploy-mode cluster |
| --conf spark.driver.maxResultSize=0 |
| SystemML.jar |
| -f Kmeans-predict.dml |
| -config SystemML-config.xml |
| -exec hybrid_spark |
| -nvargs spY=/user/ml/Y.mtx |
| prY=/user/ml/PredY.mtx |
| O=/user/ml/stats.csv |
| </code></pre> |
| </div> |
| </div> |
| |
| <hr /> |
| |
| <p><a name="table6"></a> |
| <strong>Table 6</strong>: The O-file for Kmeans-predict provides the |
| output statistics in CSV format, one per line, in the following |
| format: (NAME, [CID], VALUE). Note: the 1st group statistics are |
| given if X input is available; the 2nd group statistics |
| are given if X and C inputs are available; |
| the 3rd and 4th group statistics are given if spY input |
| is available; only the 4th group statistics contain a nonempty CID |
| value; when present, CID contains either the specified category label |
| or the predicted cluster label.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Inputs Available</th> |
| <th>Name</th> |
| <th>CID</th> |
| <th>Meaning</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td style="text-align: center" rowspan="5">X</td> |
| <td>TSS</td> |
| <td> </td> |
| <td>Total Sum of Squares (from the total mean)</td> |
| </tr> |
| <tr> |
| <td>WCSS_M</td> |
| <td> </td> |
| <td>Within-Cluster Sum of Squares (means as centers)</td> |
| </tr> |
| <tr> |
| <td>WCSS_M_PC</td> |
| <td> </td> |
| <td>Within-Cluster Sum of Squares (means), in % of TSS</td> |
| </tr> |
| <tr> |
| <td>BCSS_M</td> |
| <td> </td> |
| <td>Between-Cluster Sum of Squares (means as centers)</td> |
| </tr> |
| <tr> |
| <td>BCSS_M_PC</td> |
| <td> </td> |
| <td>Between-Cluster Sum of Squares (means), in % of TSS</td> |
| </tr> |
| <tr> |
| <td style="text-align: center" rowspan="4">X and C</td> |
| <td>WCSS_C</td> |
| <td> </td> |
| <td>Within-Cluster Sum of Squares (centroids as centers)</td> |
| </tr> |
| <tr> |
| <td>WCSS_C_PC</td> |
| <td> </td> |
| <td>Within-Cluster Sum of Squares (centroids), % of TSS</td> |
| </tr> |
| <tr> |
| <td>BCSS_C</td> |
| <td> </td> |
| <td>Between-Cluster Sum of Squares (centroids as centers)</td> |
| </tr> |
| <tr> |
| <td>BCSS_C_PC</td> |
| <td> </td> |
| <td>Between-Cluster Sum of Squares (centroids), % of TSS</td> |
| </tr> |
| <tr> |
| <td style="text-align: center" rowspan="8">spY</td> |
| <td>TRUE_SAME_CT</td> |
| <td> </td> |
| <td>Same-category pairs predicted as Same-cluster, count</td> |
| </tr> |
| <tr> |
| <td>TRUE_SAME_PC</td> |
| <td> </td> |
| <td>Same-category pairs predicted as Same-cluster, %</td> |
| </tr> |
| <tr> |
| <td>TRUE_DIFF_CT</td> |
| <td> </td> |
| <td>Diff-category pairs predicted as Diff-cluster, count</td> |
| </tr> |
| <tr> |
| <td>TRUE_DIFF_PC</td> |
| <td> </td> |
| <td>Diff-category pairs predicted as Diff-cluster, %</td> |
| </tr> |
| <tr> |
| <td>FALSE_SAME_CT</td> |
| <td> </td> |
| <td>Diff-category pairs predicted as Same-cluster, count</td> |
| </tr> |
| <tr> |
| <td>FALSE_SAME_PC</td> |
| <td> </td> |
| <td>Diff-category pairs predicted as Same-cluster, %</td> |
| </tr> |
| <tr> |
| <td>FALSE_DIFF_CT</td> |
| <td> </td> |
| <td>Same-category pairs predicted as Diff-cluster, count</td> |
| </tr> |
| <tr> |
| <td>FALSE_DIFF_PC</td> |
| <td> </td> |
| <td>Same-category pairs predicted as Diff-cluster, %</td> |
| </tr> |
| <tr> |
| <td style="text-align: center" rowspan="8">spY</td> |
| <td>SPEC_TO_PRED</td> |
| <td style="text-align: center">+</td> |
| <td>For specified category, the best predicted cluster id</td> |
| </tr> |
| <tr> |
| <td>SPEC_FULL_CT</td> |
| <td style="text-align: center">+</td> |
| <td>For specified category, its full count</td> |
| </tr> |
| <tr> |
| <td>SPEC_MATCH_CT</td> |
| <td style="text-align: center">+</td> |
| <td>For specified category, best-cluster matching count</td> |
| </tr> |
| <tr> |
| <td>SPEC_MATCH_PC</td> |
| <td style="text-align: center">+</td> |
| <td>For specified category, % of matching to full count</td> |
| </tr> |
| <tr> |
| <td>PRED_TO_SPEC</td> |
| <td style="text-align: center">+</td> |
| <td>For predicted cluster, the best specified category id</td> |
| </tr> |
| <tr> |
| <td>PRED_FULL_CT</td> |
| <td style="text-align: center">+</td> |
| <td>For predicted cluster, its full count</td> |
| </tr> |
| <tr> |
| <td>PRED_MATCH_CT</td> |
| <td style="text-align: center">+</td> |
| <td>For predicted cluster, best-category matching count</td> |
| </tr> |
| <tr> |
| <td>PRED_MATCH_PC</td> |
| <td style="text-align: center">+</td> |
| <td>For predicted cluster, % of matching to full count</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <h3 id="details">Details</h3> |
| |
| <p>Our clustering script proceeds in 3 stages: centroid initialization, |
| parallel $k$-means iterations, and the best-available output generation. |
| Centroids are initialized at random from the input records (the rows |
| of $X$), biased towards being chosen far apart from each other. The |
| initialization method is based on the <code>k-means++</code> heuristic |
| from <a href="algorithms-bibliography.html">[ArthurVassilvitskii2007]</a>, with one important difference: to |
| reduce the number of passes through $X$, we take a small sample of $X$ |
| and run the <code>k-means++</code> heuristic over this sample. Here is, |
| conceptually, our centroid initialization algorithm for one clustering |
| run:</p> |
| |
| <ol> |
| <li>Sample the rows of $X$ uniformly at random, picking each row with |
| probability $p = ks / n$ where |
| <ul> |
| <li>$k$ is the number of centroids</li> |
| <li>$n$ is the number of records</li> |
| <li>$s$ is the samp input parameter</li> |
| </ul> |
| If $ks \geq n$, the entire $X$ is used in place of its sample. |
| </li> |
| <li>Choose the first centroid uniformly at random from the sampled rows.</li> |
| <li>Choose each subsequent centroid from the sampled rows, at random, with |
| probability proportional to the squared Euclidean distance between the |
| row and the nearest already-chosen centroid.</li> |
| </ol> |
| |
| <p>The sampling of $X$ and the selection of centroids are performed |
| independently and in parallel for each run of the $k$-means algorithm. |
| When we sample the rows of $X$, rather than tossing a random coin for |
| each row, we compute the number of rows to skip until the next sampled |
| row as $\lceil \log(u) / \log(1 - p) \rceil$ where $u\in (0, 1)$ is |
| uniformly random. This time-saving trick works because</p> |
| |
| <script type="math/tex; mode=display">% <![CDATA[ |
| Prob[k-1 < \log_{1-p}(u) < k] \,\,=\,\, p(1-p)^{k-1} \,\,=\,\, |
| Prob[\textrm{skip $k-1$ rows}] %]]></script> |
| |
| <p>However, it requires us to estimate the maximum sample size, which we |
| set near $ks + 10\sqrt{ks}$ to make it generous enough.</p> |
| |
| <p>Once we selected the initial centroid sets, we start the $k$-means |
| iterations independently in parallel for all clustering runs. The number |
| of clustering runs is given as the runs input parameter. |
| Each iteration of each clustering run performs the following steps:</p> |
| |
| <ul> |
| <li>Compute the centroid-dependent part of squared Euclidean distances from |
| all records (rows of $X$) to each of the $k$ centroids using matrix |
| product.</li> |
| <li>Take the minimum of the above for each record.</li> |
| <li>Update the current within-cluster sum of squares (WCSS) value, with |
| centroids substituted instead of the means for efficiency.</li> |
| <li>Check the convergence |
| criterion: |
| <script type="math/tex">% <![CDATA[ |
| \textrm{WCSS}_{\mathrm{old}} - \textrm{WCSS}_{\mathrm{new}} < {\varepsilon}\cdot\textrm{WCSS}_{\mathrm{new}} %]]></script> |
| as |
| well as the number of iterations limit.</li> |
| <li>Find the closest centroid for each record, sharing equally any records |
| with multiple closest centroids.</li> |
| <li>Compute the number of records closest to each centroid, checking for |
| “runaway” centroids with no records left (in which case the run fails).</li> |
| <li>Compute the new centroids by averaging the records in their clusters.</li> |
| </ul> |
| |
| <p>When a termination condition is satisfied, we store the centroids and |
| the WCSS value and exit this run. A run has to satisfy the WCSS |
| convergence criterion to be considered successful. Upon the termination |
| of all runs, we select the smallest WCSS value among the successful |
| runs, and write out this run’s centroids. If requested, we also compute |
| the cluster assignment of all records in $X$, using integers from 1 |
| to $k$ as the cluster labels. The scoring script can then be used to |
| compare the cluster assignment with an externally specified category |
| assignment.</p> |
| |
| <h3 id="returns">Returns</h3> |
| |
| <p>We output the $k$ centroids for the best available clustering, |
| i. e. whose WCSS is the smallest of all successful runs. The centroids |
| are written as the rows of the $k\,{\times}\,m$-matrix into the output |
| file whose path/name was provided as the <code>C</code> input |
| argument. If the input parameter <code>isY</code> was set |
| to <code>1</code>, we also output the one-column matrix with the cluster |
| assignment for all the records. This assignment is written into the file |
| whose path/name was provided as the <code>Y</code> input argument. The |
| best WCSS value, as well as some information about the performance of |
| the other runs, is printed during the script execution. The scoring |
| script <code>Kmeans-predict.dml</code> prints all its results in a |
| self-explanatory manner, as defined in |
| <a href="algorithms-clustering.html#table6"><strong>Table 6</strong></a>.</p> |
| |
| |
| |
| </div> <!-- /container --> |
| |
| |
| |
| <script src="js/vendor/jquery-1.12.0.min.js"></script> |
| <script src="js/vendor/bootstrap.min.js"></script> |
| <script src="js/vendor/anchor.min.js"></script> |
| <script src="js/main.js"></script> |
| |
| |
| |
| |
| |
| <!-- Analytics --> |
| <script> |
| (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ |
| (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), |
| m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) |
| })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); |
| ga('create', 'UA-71553733-1', 'auto'); |
| ga('send', 'pageview'); |
| </script> |
| |
| |
| |
| <!-- MathJax Section --> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ |
| TeX: { equationNumbers: { autoNumber: "AMS" } } |
| }); |
| </script> |
| <script> |
| // Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS. |
| // We could use "//cdn.mathjax...", but that won't support "file://". |
| (function(d, script) { |
| script = d.createElement('script'); |
| script.type = 'text/javascript'; |
| script.async = true; |
| script.onload = function(){ |
| MathJax.Hub.Config({ |
| tex2jax: { |
| inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ], |
| displayMath: [ ["$$","$$"], ["\\[", "\\]"] ], |
| processEscapes: true, |
| skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] |
| } |
| }); |
| }; |
| script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') + |
| 'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; |
| d.getElementsByTagName('head')[0].appendChild(script); |
| }(document)); |
| </script> |
| </body> |
| </html> |