| <!DOCTYPE html> |
| <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]--> |
| <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]--> |
| <!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]--> |
| <!--[if gt IE 8]><!--> |
| <html class="no-js"> |
| <!--<![endif]--> |
| <!-- |
| |
| --> |
| |
| <head> |
| <title>Algorithms Reference Classification - SystemDS 2.2.0</title> |
| <meta charset="utf-8"> |
| <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> |
| <meta name="description" content=""> |
| <meta name="viewport" content="width=device-width"> |
| <link rel="stylesheet" href="./../css/bootstrap.min.css"> |
| <link rel="stylesheet" href="./../css/main.css"> |
| <link rel="stylesheet" href="./../css/pygments-default.css"> |
| <link rel="shortcut icon" href="./../img/favicon.png"> |
| <script src="./../js/vendor/jquery-1.12.0.min.js"></script> |
| <script src="./../js/vendor/bootstrap.min.js"></script> |
| <script src="./../js/vendor/anchor.min.js"></script> |
| <script src="./../js/main.js"></script> |
| </head> |
| |
| <body> |
| <!-- |
| |
| --> |
| <header class="navbar navbar-default navbar-fixed-top" id="topbar"> |
| <div class="container"> |
| <div class="navbar-header"> |
| <div class="navbar-brand brand projectlogo"> |
| <a href="https://systemds.apache.org/"><img class="logo" src="./../img/systemds-logo.png" alt="Apache SystemDS" title="Apache SystemDS" /></a> |
| </div> |
| <div class="navbar-brand brand projecttitle"> |
| <a href="https://systemds.apache.org/">Apache SystemDS<sup id="trademark">™</sup></a><br /> |
| <span class="version">2.2.0</span> |
| </div> |
| <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse"> |
| <span class="sr-only">Toggle navigation</span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| <span class="icon-bar"></span> |
| </button> |
| </div> |
| <nav class="navbar-collapse collapse"> |
| <ul class="nav navbar-nav navbar-right"> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">Overview<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><b>Home:</b></li> |
| <li><a href="./../">Docs Home</a></li> |
| <li class="divider"></li> |
| <li><b>Running SystemDS:</b></li> |
| <li><a href="./../site/run">Standalone Guide</a></li> |
| <li><a href="./../site/gpu">GPU Guide</a></li> |
| <li><a href="./../site/native-backend">Native Backend (BLAS)</a></li> |
| <li class="divider"></li> |
| <li><b>Language Guides:</b></li> |
| <li><a href="./../site/dml-language-reference.html">DML Language Reference</a></li> |
| <li><a href="./../site/builtins-reference.html">Built-in Functions Reference</a></li> |
| <li><a href="./../site/dml-vs-r-guide.html">DML vs R guide</a></li> |
| <li class="divider"></li> |
| <li><b>Algorithms:</b></li> |
| <li><a href="./../site/algorithms-reference.html">ML Algorithms Reference</a></li> |
| <li class="divider"></li> |
| <li><b>Other:</b></li> |
| <li><a href="https://github.com/apache/systemds/blob/master/CONTRIBUTING.md">Contributing to SystemDS 🡕</a></li> |
| </ul> |
| </li> |
| <li><a href="https://github.com/apache/systemds">GitHub 🡕</a></li> |
| <li class="dropdown"> |
| <a href="#" class="dropdown-toggle" data-toggle="dropdown">API<b class="caret"></b></a> |
| <ul class="dropdown-menu" role="menu"> |
| <li><a href="./../api/java/">Java</a></li> |
| <li><a href="./../api/python/">Python</a></li> |
| </ul> |
| </li> |
| <li><a href="https://issues.apache.org/jira/secure/Dashboard.jspa?selectPageId=12335852">Issues</a></li> |
| </ul> |
| </nav> |
| </div> |
| </header> |
| |
| <div class="container" id="content"> |
| <h1 class="title">Algorithms Reference Classification</h1> |
| <!-- |
| |
| --> |
| |
| <h2 id="multinomial-logistic-regression">Multinomial Logistic Regression</h2> |
| |
| <h3 id="description">Description</h3> |
| |
| <p>The <code class="language-plaintext highlighter-rouge">MultiLogReg.dml</code> script performs both binomial and multinomial |
| logistic regression. The script is given a dataset $(X, Y)$ where matrix |
| $X$ has $m$ columns and matrix $Y$ has one column; both $X$ and $Y$ have |
| $n$ rows. The rows of $X$ and $Y$ are viewed as a collection of records: |
| $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of |
| explanatory (feature) variables and $y_i$ is a categorical response |
| variable. Each row $x_i$ in $X$ has size $\dim x_i = m$, while its corresponding $y_i$ |
| is an integer that represents the observed response value for |
| record $i$.</p> |
| |
| <p>The goal of logistic regression is to learn a linear model over the |
| feature vector $x_i$ that can be used to predict how likely each |
| categorical label is expected to be observed as the actual $y_i$. Note |
| that logistic regression predicts more than a label: it predicts the |
| probability for every possible label. The binomial case allows only two |
| possible labels, the multinomial case has no such restriction.</p> |
| |
| <p>Just as linear regression estimates the mean value $\mu_i$ of a |
| numerical response variable, logistic regression does the same for |
| category label probabilities. In linear regression, the mean of $y_i$ is |
| estimated as a linear combination of the features: |
| $\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$. |
| In logistic regression, the label probability has to lie between 0 |
| and 1, so a link function is applied to connect it to |
| $\beta_0 + x_i\beta_{1:m}$. If there are just two possible category |
| labels, for example 0 and 1, the logistic link looks as follows:</p> |
| |
| \[Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\, |
| \frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}; |
| \quad |
| Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\, |
| \frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}\] |
| |
| <p>Here category label 0 |
| serves as the <em>baseline</em>, and function $\exp(\beta_0 + x_i\beta_{1:m})$ |
| shows how likely we expect to see “$y_i = 1$” in comparison to the |
| baseline. Like in a loaded coin, the predicted odds of seeing 1 versus 0 |
| are $\exp(\beta_0 + x_i\beta_{1:m})$ to 1, with each feature $x_{i,j}$ |
| multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds. Given a |
| large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic |
| regression seeks to find the $\beta_j$’s that maximize the product of |
| probabilities $Prob[y_i\mid x_i; \beta]$ |
| for actually observed $y_i$-labels (assuming no |
| regularization).</p> |
| |
| <p>Multinomial logistic regression <a href="algorithms-bibliography.html">[Agresti2002]</a> |
| extends this link to |
| $k \geq 3$ possible categories. Again we identify one category as the |
| baseline, for example the $k$-th category. Instead of a coin, here we |
| have a loaded multisided die, one side per category. Each non-baseline |
| category $l = 1\ldots k\,{-}\,1$ has its own vector |
| $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$ of regression |
| parameters with the intercept, making up a matrix $B$ of size |
| $(m\,{+}\,1)\times(k\,{-}\,1)$. The predicted odds of seeing |
| non-baseline category $l$ versus the baseline $k$ are |
| $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$ |
| to 1, and the predicted probabilities are:</p> |
| |
| \[\begin{equation} |
| l < k: Prob [y_i {=} l \mid x_i; B] \,\,\,{=}\,\,\, |
| \frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}; |
| \end{equation}\] |
| |
| \[\begin{equation} |
| Prob [y_i {=} k \mid x_i; B] \,\,\,{=}\,\,\, |
| \frac{1}{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}. |
| \end{equation}\] |
| |
| <p>The goal of the regression |
| is to estimate the parameter matrix $B$ from the provided dataset |
| $(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of $Prob[y_i\mid x_i; B]$ over the |
| observed labels $y_i$. Taking its logarithm, negating, and adding a |
| regularization term gives us a minimization objective:</p> |
| |
| \[\begin{equation} |
| f(B; X, Y) \,\,=\,\, |
| -\sum_{i=1}^n \,\log Prob[y_i\mid x_i; B] \,+\, |
| \frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2 |
| \,\,\to\,\,\min |
| \end{equation}\] |
| |
| <p>The optional regularization term is added to |
| mitigate overfitting and degeneracy in the data; to reduce bias, the |
| intercepts $\beta_{0,l}$ are not regularized. Once the $\beta_{j,l}$’s |
| are accurately estimated, we can make predictions about the category |
| label $y$ for a new feature vector $x$ using |
| Eqs. (1) and (2).</p> |
| |
| <hr /> |
| |
| <h4 id="table-5">Table 5</h4> |
| |
| <p>: The <code class="language-plaintext highlighter-rouge">Log</code> file for multinomial logistic regression |
| contains the following iteration variables in <code class="language-plaintext highlighter-rouge">CSV</code> format, each line |
| containing triple (<code class="language-plaintext highlighter-rouge">Name</code>, <code class="language-plaintext highlighter-rouge">Iteration#</code>, <code class="language-plaintext highlighter-rouge">Value</code>) with <code class="language-plaintext highlighter-rouge">Iteration#</code> being 0 |
| for initial values.</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th>Name</th> |
| <th>Meaning</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>LINEAR_TERM_MIN</td> |
| <td>The minimum value of $X$ %*% $B$, used to check for overflows</td> |
| </tr> |
| <tr> |
| <td>LINEAR_TERM_MAX</td> |
| <td>The maximum value of $X$ %*% $B$, used to check for overflows</td> |
| </tr> |
| <tr> |
| <td>NUM_CG_ITERS</td> |
| <td>Number of inner (Conj. Gradient) iterations in this outer iteration</td> |
| </tr> |
| <tr> |
| <td>IS_TRUST_REACHED</td> |
| <td>$1 = {}$trust region boundary was reached, $0 = {}$otherwise</td> |
| </tr> |
| <tr> |
| <td>POINT_STEP_NORM</td> |
| <td>L2-norm of iteration step from old point (matrix $B$) to new point</td> |
| </tr> |
| <tr> |
| <td>OBJECTIVE</td> |
| <td>The loss function we minimize (negative regularized log-likelihood)</td> |
| </tr> |
| <tr> |
| <td>OBJ_DROP_REAL</td> |
| <td>Reduction in the objective during this iteration, actual value</td> |
| </tr> |
| <tr> |
| <td>OBJ_DROP_PRED</td> |
| <td>Reduction in the objective predicted by a quadratic approximation</td> |
| </tr> |
| <tr> |
| <td>OBJ_DROP_RATIO</td> |
| <td>Actual-to-predicted reduction ratio, used to update the trust region</td> |
| </tr> |
| <tr> |
| <td>IS_POINT_UPDATED</td> |
| <td>$1 = {}$new point accepted; $0 = {}$new point rejected, old point restored</td> |
| </tr> |
| <tr> |
| <td>GRADIENT_NORM</td> |
| <td>L2-norm of the loss function gradient (omitted if point is rejected)</td> |
| </tr> |
| <tr> |
| <td>RUST_DELTA</td> |
| <td>Updated trust region size, the “delta”</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <h3 id="details">Details</h3> |
| |
| <p>We estimate the logistic regression parameters via L2-regularized |
| negative log-likelihood minimization (3). The |
| optimization method used in the script closely follows the trust region |
| Newton method for logistic regression described in <a href="algorithms-bibliography.html">[Lin2008]</a>. |
| For convenience, let us make some changes in notation:</p> |
| |
| <ul> |
| <li>Convert the input vector of observed category labels into an indicator |
| matrix $Y$ of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th |
| category label is $l$ and $Y_{i, l} = 0$ otherwise.</li> |
| <li>Append an extra column of all ones, i.e. $(1, 1, \ldots, 1)^T$, as the |
| $m\,{+}\,1$-st column to the feature matrix $X$ to represent the |
| intercept.</li> |
| <li>Append an all-zero column as the $k$-th column to $B$, the matrix of |
| regression parameters, to represent the baseline category.</li> |
| <li>Convert the regularization constant $\lambda$ into matrix $\Lambda$ of |
| the same size as $B$, placing 0’s into the $m\,{+}\,1$-st row to disable |
| intercept regularization, and placing $\lambda$’s everywhere else.</li> |
| </ul> |
| |
| <p>Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given by |
| (1) and (2) and the |
| objective function $f$ in (3) have the matrix form</p> |
| |
| \[\begin{aligned} |
| P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\ |
| f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \, |
| {\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \, |
| (1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B\end{aligned}\] |
| |
| <p>where operations $\cdot\,$, <code class="language-plaintext highlighter-rouge">/</code>, <code class="language-plaintext highlighter-rouge">exp</code>, and <code class="language-plaintext highlighter-rouge">log</code> are applied |
| cellwise, and $\textstyle\sum$ denotes the sum of all cells in a matrix. |
| The gradient of $f$ with respect to $B$ can be represented as a matrix |
| too:</p> |
| |
| \[\nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B\] |
| |
| <p>The Hessian $\mathcal{H}$ of $f$ is a tensor, but, fortunately, the |
| conjugate gradient inner loop of the trust region algorithm |
| in <a href="algorithms-bibliography.html">[Lin2008]</a> |
| does not need to instantiate it. We only need to |
| multiply $\mathcal{H}$ by ordinary matrices of the same size as $B$ and |
| $\nabla f$, and this can be done in matrix form:</p> |
| |
| \[\mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\, |
| \Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV)\] |
| |
| <p>At each Newton iteration (the <em>outer</em> iteration) the minimization algorithm |
| approximates the difference |
| $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$ attained in the |
| objective function after a step $B \mapsto B\,{+}\,S$ by a second-degree |
| formula</p> |
| |
| \[\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S |
| \,+\, {\textstyle\sum}\,\,S\cdot \nabla f\] |
| |
| <p>This approximation is then |
| minimized by trust-region conjugate gradient iterations (the <em>inner</em> |
| iterations) subject to the constraint</p> |
| |
| \[\|S\|_2 \leq \delta\] |
| |
| <p>The trust |
| region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max_i |x_i|_2$ |
| and updated as described in <a href="algorithms-bibliography.html">[Lin2008]</a>. |
| Users can specify the maximum number of the outer |
| and the inner iterations with input parameters <code class="language-plaintext highlighter-rouge">moi</code> and |
| <code class="language-plaintext highlighter-rouge">mii</code>, respectively. The iterative minimizer terminates |
| successfully if</p> |
| |
| \[\|\nabla f\|_2 < \varepsilon \|\nabla f_{B=0} \|_2\] |
| |
| <p>, where ${\varepsilon}> 0$ is a tolerance supplied by the user via input parameter <code class="language-plaintext highlighter-rouge">tol</code>.</p> |
| |
| <h3 id="returns">Returns</h3> |
| |
| <p>The estimated regression parameters (the \(.\hat{\beta}_{j, l}\) ) are |
| populated into a matrix and written to an HDFS file whose path/name was |
| provided as the <code class="language-plaintext highlighter-rouge">B</code> input argument. Only the non-baseline |
| categories ($1\leq l \leq k\,{-}\,1$) have their $ \hat{\beta}<em>{j, l}$ |
| in the output; to add the baseline category, just append a column of zeros. |
| If <code class="language-plaintext highlighter-rouge">icpt=0</code> in the input command line, no intercepts are used |
| and <code class="language-plaintext highlighter-rouge">B</code> has size |
| $m\times (k\,{-}\,1)$; otherwise |
| <code class="language-plaintext highlighter-rouge">B</code> has size |
| $(m\,{+}\,1)\times (k\,{-}\,1)$ |
| and the |
| intercepts are in the |
| $m\,{+}\,1$-st row. If icpt=2, then |
| initially the feature columns in $X$ are shifted to mean${} = 0$ and |
| rescaled to variance${} = 1$. After the iterations converge, the |
| $\hat{\beta}</em>{j, l}$’s are rescaled and shifted to work with the |
| original features.</p> |
| |
| <hr /> |
| |
| <h2 id="support-vector-machines">Support Vector Machines</h2> |
| |
| <h3 id="binary-class-support-vector-machines">Binary-Class Support Vector Machines</h3> |
| |
| <h4 id="binary-svm-description">Binary SVM Description</h4> |
| |
| <p>Support Vector Machines are used to model the relationship between a |
| categorical dependent variable <code class="language-plaintext highlighter-rouge">y</code> and one or more explanatory variables |
| denoted <code class="language-plaintext highlighter-rouge">X</code>. This implementation learns (and predicts with) a binary class |
| support vector machine (<code class="language-plaintext highlighter-rouge">y</code> with domain size <code class="language-plaintext highlighter-rouge">2</code>).</p> |
| |
| <h4 id="binary-svm-details">Binary SVM Details</h4> |
| |
| <p>Support vector machines learn a classification function by solving the |
| following optimization problem ($L_2$-SVM):</p> |
| |
| \[\begin{aligned} |
| &\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\ |
| &\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i\end{aligned}\] |
| |
| <p>where $x_i$ is an example from the training set with its label given by |
| $y_i$, $w$ is the vector of parameters and $\lambda$ is the |
| regularization constant specified by the user.</p> |
| |
| <p>To account for the missing bias term, one may augment the data with a |
| column of constants which is achieved by setting the intercept argument to <code class="language-plaintext highlighter-rouge">1</code> |
| <a href="algorithms-bibliography.html">[Hsieh2008]</a>.</p> |
| |
| <p>This implementation optimizes the primal directly |
| <a href="algorithms-bibliography.html">[Chapelle2007]</a>. It |
| uses nonlinear conjugate gradient descent to minimize the objective |
| function coupled with choosing step-sizes by performing one-dimensional |
| Newton minimization in the direction of the gradient.</p> |
| |
| <h4 id="binary-svm-returns">Binary SVM Returns</h4> |
| |
| <p>The learnt weights produced by <code class="language-plaintext highlighter-rouge">l2-svm.dml</code> are populated into a single |
| column matrix and written to file on HDFS (see model in section |
| Arguments). The number of rows in this matrix is <code class="language-plaintext highlighter-rouge">ncol(X)</code> if intercept |
| was set to <code class="language-plaintext highlighter-rouge">0</code> during invocation and <code class="language-plaintext highlighter-rouge">ncol(X) + 1</code> otherwise. The bias term, |
| if used, is placed in the last row. Depending on what arguments are |
| provided during invocation, <code class="language-plaintext highlighter-rouge">l2-svm-predict.dml</code> may compute one or more |
| of scores, accuracy and confusion matrix in the output format |
| specified.</p> |
| |
| <hr /> |
| |
| <h3 id="multi-class-support-vector-machines">Multi-Class Support Vector Machines</h3> |
| |
| <h4 id="multi-svm-description">Multi SVM Description</h4> |
| |
| <p>Support Vector Machines are used to model the relationship between a |
| categorical dependent variable <code class="language-plaintext highlighter-rouge">y</code> and one or more explanatory variables |
| denoted <code class="language-plaintext highlighter-rouge">X</code>. This implementation supports dependent variables that have |
| domain size greater or equal to <code class="language-plaintext highlighter-rouge">2</code> and hence is not restricted to binary |
| class labels.</p> |
| |
| <h4 id="multi-svm-details">Multi SVM Details</h4> |
| |
| <p>Support vector machines learn a classification function by solving the |
| following optimization problem ($L_2$-SVM):</p> |
| |
| \[\begin{aligned} |
| &\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\ |
| &\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i\end{aligned}\] |
| |
| <p>where $x_i$ is an example from the training set with its label given by |
| $y_i$, $w$ is the vector of parameters and $\lambda$ is the |
| regularization constant specified by the user.</p> |
| |
| <p>To extend the above formulation (binary class SVM) to the multiclass |
| setting, one standard approach is to learn one binary class SVM per |
| class that separates data belonging to that class from the rest of the |
| training data (one-against-the-rest SVM, see |
| <a href="algorithms-bibliography.html">[Scholkopf1995]</a>).</p> |
| |
| <p>To account for the missing bias term, one may augment the data with a |
| column of constants which is achieved by setting intercept argument to 1 |
| <a href="algorithms-bibliography.html">[Hsieh2008]</a>.</p> |
| |
| <p>This implementation optimizes the primal directly |
| <a href="algorithms-bibliography.html">[Chapelle2007]</a>. It |
| uses nonlinear conjugate gradient descent to minimize the objective |
| function coupled with choosing step-sizes by performing one-dimensional |
| Newton minimization in the direction of the gradient.</p> |
| |
| <h4 id="multi-svm-returns">Multi SVM Returns</h4> |
| |
| <p>The learnt weights produced by <code class="language-plaintext highlighter-rouge">m-svm.dml</code> are populated into a matrix |
| that has as many columns as there are classes in the training data, and |
| written to file provided on HDFS (see model in section Arguments). The |
| number of rows in this matrix is <code class="language-plaintext highlighter-rouge">ncol(X)</code> if intercept was set to <code class="language-plaintext highlighter-rouge">0</code> |
| during invocation and <code class="language-plaintext highlighter-rouge">ncol(X) + 1</code> otherwise. The bias terms, if used, |
| are placed in the last row. Depending on what arguments are provided |
| during invocation, <code class="language-plaintext highlighter-rouge">m-svm-predict.dml</code> may compute one or more of scores, |
| accuracy and confusion matrix in the output format specified.</p> |
| |
| <hr /> |
| |
| <h2 id="naive-bayes">Naive Bayes</h2> |
| |
| <h3 id="naive-bayes-description">Naive Bayes Description</h3> |
| |
| <p>Naive Bayes is very simple generative model used for classifying data. |
| This implementation learns a multinomial naive Bayes classifier which is |
| applicable when all features are counts of categorical values.</p> |
| |
| <h3 id="naive-bayes-details">Naive Bayes Details</h3> |
| |
| <p>Naive Bayes is a very simple generative classification model. It posits |
| that given the class label, features can be generated independently of |
| each other. More precisely, the (multinomial) naive Bayes model uses the |
| following equation to estimate the joint probability of a feature vector |
| $x$ belonging to class $y$:</p> |
| |
| \[\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)}\] |
| |
| <p>where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a |
| feature present in $x$ with $n(i,x)$ denoting its count and |
| $\theta_{iy}$ denotes the class conditional probability of feature $i$ |
| in class $y$. The usual constraints hold on $\pi$ and $\theta$:</p> |
| |
| \[\begin{aligned} |
| && \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\ |
| \forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1\end{aligned}\] |
| |
| <p>where $\mathcal{C}$ is the set of classes.</p> |
| |
| <p>Given a fully labeled training dataset, it is possible to learn a naive |
| Bayes model using simple counting (group-by aggregates). To compute the |
| class conditional probabilities, it is usually advisable to avoid |
| setting $\theta_{iy}$ to <code class="language-plaintext highlighter-rouge">0</code>. One way to achieve this is using additive |
| smoothing or Laplace smoothing. Some authors have argued that this |
| should in fact be add-one smoothing. This implementation uses add-one |
| smoothing by default but lets the user specify her/his own constant, if |
| required.</p> |
| |
| <p>This implementation is sometimes referred to as <em>multinomial</em> naive |
| Bayes. Other flavours of naive Bayes are also popular.</p> |
| |
| <h3 id="naive-bayes-returns">Naive Bayes Returns</h3> |
| |
| <p>The learnt model produced by <code class="language-plaintext highlighter-rouge">naive-bayes.dml</code> is stored in two separate |
| files. The first file stores the class prior (a single-column matrix). |
| The second file stores the class conditional probabilities organized |
| into a matrix with as many rows as there are class labels and as many |
| columns as there are features. Depending on what arguments are provided |
| during invocation, <code class="language-plaintext highlighter-rouge">naive-bayes-predict.dml</code> may compute one or more of |
| probabilities, accuracy and confusion matrix in the output format |
| specified.</p> |
| |
| <hr /> |
| |
| <h2 id="decision-trees">Decision Trees</h2> |
| |
| <h3 id="decision-trees-description">Decision Trees Description</h3> |
| |
| <p>Decision tree (for classification) is a classifier that is considered |
| more interpretable than other statistical classifiers. This |
| implementation is well-suited to handle large-scale data and builds a |
| (binary) decision tree in parallel.</p> |
| |
| <h3 id="decision-trees-details">Decision Trees Details</h3> |
| |
| <p>Decision trees <a href="algorithms-bibliography.html">[BreimanFOS1984]</a> are simple models of classification |
| that, due to their structure, are easy to interpret. Given an example |
| feature vector, each node in the learned tree runs a simple test on it. |
| Based on the result of the test, the example is either diverted to the |
| left subtree or to the right subtree. Once the example reaches a leaf, |
| then the label stored at the leaf is returned as the prediction for the |
| example.</p> |
| |
| <p>Building a decision tree from a fully labeled training set entails |
| choosing appropriate splitting tests for each internal node in the tree |
| and this is usually performed in a top-down manner. The splitting test |
| (denoted by $s$) requires first choosing a feature $j$ and depending on |
| the type of $j$, either a threshold $\sigma$, in case $j$ is |
| continuous-valued, or a subset of values $S \subseteq \text{Dom}(j)$ |
| where $\text{Dom}(j)$ denotes domain of $j$, in case it is categorical. |
| For continuous-valued features the test is thus of form $x_j < \sigma$ |
| and for categorical features it is of form $x_j \in S$, where $x_j$ |
| denotes the $j$th feature value of feature vector $x$. One way to |
| determine which test to include, is to compare impurities of the tree |
| nodes induced by the test. The <em>node impurity</em> measures the |
| homogeneity of the labels at the node. This implementation supports two |
| commonly used impurity measures (denoted by $\mathcal{I}$): |
| <em>Entropy</em> |
| $\mathcal{E}=\sum_{i=1}^{C}-f_i \log f_i$, as |
| well as <em>Gini impurity</em> |
| $\mathcal{G}=\sum_{i=1}^{C}f_i (1-f_i)$, where $C$ denotes the number of |
| unique labels and $f_i$ is the frequency of label $i$. Once the impurity |
| at the tree nodes has been obtained, the <em>best split</em> is |
| chosen from a set of possible splits that maximizes the |
| <em>information gain</em> at the node, i.e., |
| $\arg\max_{s}\mathcal{IG}(X,s)$, where $\mathcal{IG}(X,s)$ denotes the |
| information gain when the splitting test $s$ partitions the feature |
| matrix $X$. Assuming that $s$ partitions $X$ that contains $N$ feature |
| vectors into $X_\text{left}$ and $X_\text{right}$ each including |
| $N_\text{left}$ and $N_\text{right}$ feature vectors, respectively, |
| $\mathcal{IG}(X,s)$ is given by</p> |
| |
| \[\mathcal{IG}(X,s)=\mathcal{I}(X)-\frac{N_\text{left}}{N}\mathcal{I}(X_\text{left})-\frac{N_\text{right}}{N}\mathcal{I}(X_\text{right})\] |
| |
| <p>where $\mathcal{I}\in{\mathcal{E},\mathcal{G}}$. In the following we |
| discuss the implementation details specific to |
| <code class="language-plaintext highlighter-rouge">decision-tree.dml</code>.</p> |
| |
| <p><strong>Input format.</strong> In general implementations of the decision tree |
| algorithm do not require categorical features to be dummy coded. For |
| improved efficiency and reducing the training time, our implementation |
| however assumes dummy coded categorical features and dummy coded class |
| labels.</p> |
| |
| <p><strong>Tree construction.</strong> Learning a decision tree on large-scale data has |
| received some attention in the literature. The current implementation |
| includes logic for choosing tests for multiple nodes that belong to the |
| same level in the decision tree in parallel (breadth-first expansion) |
| and for building entire subtrees under multiple nodes in parallel |
| (depth-first subtree building). Empirically it has been demonstrated |
| that it is advantageous to perform breadth-first expansion for the nodes |
| belonging to the top levels of the tree and to perform depth-first |
| subtree building for nodes belonging to the lower levels of the |
| tree <a href="algorithms-bibliography.html">[PandaHBB2009]</a>. The parameter <code class="language-plaintext highlighter-rouge">num_samples</code> controls |
| when we switch to depth-first subtree building. Any node in the decision |
| tree that receives $\leq$ <code class="language-plaintext highlighter-rouge">num_samples</code> training examples, |
| the subtree under it is built in its entirety in one shot.</p> |
| |
| <p><strong>Stopping rule and pruning.</strong> The splitting of data at the internal |
| nodes stops when at least one the following criteria is satisfied:</p> |
| |
| <ul> |
| <li>the depth of the internal node reaches the input parameter |
| <code class="language-plaintext highlighter-rouge">depth</code> controlling the maximum depth of the learned |
| tree, or</li> |
| <li>no candidate split achieves information gain.</li> |
| </ul> |
| |
| <p>This implementation also allows for some automated pruning via the |
| argument <code class="language-plaintext highlighter-rouge">num_leaf</code>. If a node receives $\leq$ |
| <code class="language-plaintext highlighter-rouge">num_leaf</code> training examples, then a leaf is built in its |
| place.</p> |
| |
| <p><strong>Continuous-valued features.</strong> For a continuous-valued feature $j$ the |
| number of candidate thresholds $\sigma$ to choose from is of the order |
| of the number of examples present in the training set. Since for |
| large-scale data this can result in a large number of candidate |
| thresholds, the user can limit this number via the arguments |
| <code class="language-plaintext highlighter-rouge">bins</code> which controls the number of candidate thresholds |
| considered for each continuous-valued feature. For each |
| continuous-valued feature, the implementation computes an equi-height |
| histogram to generate one candidate threshold per equi-height bin.</p> |
| |
| <p><strong>Categorical features.</strong> In order to determine the best value subset to |
| split on in the case of categorical features, this implementation |
| greedily includes values from the feature’s domain until the information |
| gain stops improving. In particular, for a categorical feature $j$ the |
| $|Dom(j)|$ feature values are sorted by impurity and the resulting split |
| candidates $|Dom(j)|-1$ are examined; the sequence of feature values |
| which results in the maximum information gain is then selected.</p> |
| |
| <p><strong>Description of the model.</strong> The learned decision tree is represented |
| in a matrix $M$ that contains at least 6 rows. Each column in the matrix |
| contains the parameters relevant to a single node in the tree. Note that |
| for building the tree model, our implementation splits the feature |
| matrix $X$ into $X_\text{cont}$ containing continuous-valued features |
| and $X_\text{cat}$ containing categorical features. In the following, |
| the continuous-valued (resp. categorical) feature-ids correspond to the |
| indices of the features in $X_\text{cont}$ (resp. $X_\text{cat}$). |
| Moreover, we refer to an internal node as a continuous-valued |
| (categorical) node if the feature that this nodes looks at is |
| continuous-valued (categorical). Below is a description of what each row |
| in the matrix contains.</p> |
| |
| <ul> |
| <li>Row 1: stores the node-ids. These ids correspond to the node-ids in |
| a complete binary tree.</li> |
| <li>Row 2: for internal nodes stores the offsets (the number of columns) |
| in $M$ to the left child, and otherwise <code class="language-plaintext highlighter-rouge">0</code>.</li> |
| <li>Row 3: stores the feature index of the feature (id of a |
| continuous-valued feature in $X_\text{cont}$ if the feature is |
| continuous-valued or id of a categorical feature in $X_\text{cat}$ |
| if the feature is categorical) that this node looks at if the node |
| is an internal node, otherwise <code class="language-plaintext highlighter-rouge">0</code>.</li> |
| <li>Row 4: store the type of the feature that this node looks at if the |
| node is an internal node: <code class="language-plaintext highlighter-rouge">1</code> for continuous-valued and <code class="language-plaintext highlighter-rouge">2</code> for |
| categorical features, otherwise the label this leaf node is supposed |
| to predict.</li> |
| <li>Row 5: for the internal nodes contains <code class="language-plaintext highlighter-rouge">1</code> if the feature chosen for |
| the node is continuous-valued, or the size of the subset of values |
| used for splitting at the node stored in rows 6,7,$\ldots$ if the |
| feature chosen for the node is categorical. For the leaf nodes, Row |
| 5 contains the number of misclassified training examples reaching at |
| this node.</li> |
| <li>Row 6,7,$\ldots$: for the internal nodes, row 6 stores the threshold |
| to which the example’s feature value is compared if the feature |
| chosen for this node is continuous-valued, otherwise if the feature |
| chosen for this node is categorical rows 6,7,$\ldots$ store the |
| value subset chosen for the node. For the leaf nodes, row 6 contains |
| <code class="language-plaintext highlighter-rouge">1</code> if the node is impure and the number of training examples at the |
| node is greater than <code class="language-plaintext highlighter-rouge">num_leaf</code>, otherwise <code class="language-plaintext highlighter-rouge">0</code>.</li> |
| </ul> |
| |
| <p>As an example, <a href="figure-2"><strong>Figure 2</strong></a> shows a decision tree with $5$ nodes and |
| its matrix representation.</p> |
| |
| <hr /> |
| |
| <h4 id="figure-2">Figure 2</h4> |
| |
| <p><strong>Figure 2</strong>: (a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $X_\text{cont}$ (resp. $X_\text{cat}$). In this example all leaf nodes are pure and no training example is misclassified.</p> |
| |
| <p>(a) <img src="../img/algorithms-reference/example-tree.png" alt="Figure 2" title="Figure 2" /></p> |
| |
| <p>(b)</p> |
| |
| <table> |
| <thead> |
| <tr> |
| <th> </th> |
| <th>Col 1</th> |
| <th>Col 2</th> |
| <th>Col 3</th> |
| <th>Col 4</th> |
| <th>Col 5</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>Row 1</td> |
| <td>1</td> |
| <td>2</td> |
| <td>3</td> |
| <td>6</td> |
| <td>7</td> |
| </tr> |
| <tr> |
| <td>Row 2</td> |
| <td>1</td> |
| <td>0</td> |
| <td>1</td> |
| <td>0</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Row 3</td> |
| <td>3</td> |
| <td>0</td> |
| <td>5</td> |
| <td>0</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Row 4</td> |
| <td>1</td> |
| <td>1</td> |
| <td>2</td> |
| <td>2</td> |
| <td>1</td> |
| </tr> |
| <tr> |
| <td>Row 5</td> |
| <td>1</td> |
| <td>0</td> |
| <td>2</td> |
| <td>0</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Row 6</td> |
| <td>0.45</td> |
| <td>0</td> |
| <td>2</td> |
| <td>0</td> |
| <td>0</td> |
| </tr> |
| <tr> |
| <td>Row 7</td> |
| <td> </td> |
| <td> </td> |
| <td>3</td> |
| <td> </td> |
| <td> </td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <hr /> |
| |
| <h3 id="decision-trees-returns">Decision Trees Returns</h3> |
| |
| <p>The matrix corresponding to the learned model as well as the training |
| accuracy (if requested) is written to a file in the format specified. |
| See details where the structure of the model matrix is described. Recall |
| that in our implementation $X$ is split into $X_\text{cont}$ and |
| $X_\text{cat}$. If requested, the mappings of the continuous-valued |
| feature-ids in $X_\text{cont}$ (stored at <code class="language-plaintext highlighter-rouge">S_map</code>) and the |
| categorical feature-ids in $X_\text{cat}$ (stored at |
| <code class="language-plaintext highlighter-rouge">C_map</code>) to the global feature-ids in $X$ will be provided. |
| Depending on what arguments are provided during invocation, the |
| <code class="language-plaintext highlighter-rouge">decision-tree-predict.dml</code> script may compute one or more of |
| predictions, accuracy and confusion matrix in the requested output |
| format.</p> |
| |
| <h2 id="random-forests">Random Forests</h2> |
| |
| <h3 id="random-forests-description">Random Forests Description</h3> |
| |
| <p>Random forest is one of the most successful machine learning methods for |
| classification and regression. It is an ensemble learning method that |
| creates a model composed of a set of tree models. This implementation is |
| well-suited to handle large-scale data and builds a random forest model |
| for classification in parallel.</p> |
| |
| <h3 id="random-forests-details">Random Forests Details</h3> |
| |
| <p>Random forests <a href="algorithms-bibliography.html">[Breiman2001]</a> |
| are learning algorithms for ensembles |
| of decision trees. The main idea is to build a number of decision trees |
| on bootstrapped training samples, i.e., by taking repeatedly samples |
| from a (single) training set. Moreover, instead of considering all the |
| features when building the trees only a random subset of the |
| features—typically $\approx \sqrt{D}$, where $D$ is the number of |
| features—is chosen each time a split test at a tree node is performed. |
| This procedure <em>decorrelates</em> the trees and makes it less |
| prone to overfitting. To build decision trees we utilize the techniques |
| discussed in <a href="algorithms-classification.html#decision-trees">Decision Trees</a> and proposed |
| in <a href="algorithms-bibliography.html">[PandaHBB2009]</a>; the implementation details are similar to those of |
| the decision trees script. Below we review some features of our |
| implementation which differ from <code class="language-plaintext highlighter-rouge">decision-tree.dml</code>.</p> |
| |
| <p><strong>Bootstrapped sampling.</strong> Each decision tree is fitted to a |
| bootstrapped training set sampled with replacement (WR). To improve |
| efficiency, we generate $N$ sample counts according to a Poisson |
| distribution with parameter <code class="language-plaintext highlighter-rouge">subsamp_rate</code>, where $N$ |
| denotes the total number of training points. These sample counts |
| approximate WR sampling when $N$ is large enough and are generated |
| upfront for each decision tree.</p> |
| |
| <p><strong>Bagging.</strong> Decision trees suffer from <em>high variance</em> |
| resulting in different models whenever trained on a random subsets of |
| the data points. <em>Bagging</em> is a general-purpose method to |
| reduce the variance of a statistical learning method like decision |
| trees. In the context of decision trees (for classification), for a |
| given test feature vector the prediction is computed by taking a |
| <em>majority vote</em>: the overall prediction is the most |
| commonly occurring class among all the tree predictions.</p> |
| |
| <p><strong>Out-Of-Bag error estimation.</strong> Note that each bagged tree in a random |
| forest model is trained on a subset (around $\frac{2}{3}$) of the |
| observations (i.e., feature vectors). The remaining ($\frac{1}{3}$ of |
| the) observations not used for training is called the |
| <em>Out-Of-Bag</em> (<code class="language-plaintext highlighter-rouge">OOB</code>) observations. This gives us a |
| straightforward way to estimate the test error: to predict the class |
| label of each test observation $i$ we use the trees in which $i$ was |
| <code class="language-plaintext highlighter-rouge">OOB</code>. Our <code class="language-plaintext highlighter-rouge">random-forest-predict.dml</code> script provides the <code class="language-plaintext highlighter-rouge">OOB</code> |
| error estimate for a given training set if requested.</p> |
| |
| <p><strong>Description of the model.</strong> Similar to decision trees, the learned |
| random forest model is presented in a matrix $M$ with at least <code class="language-plaintext highlighter-rouge">7</code> rows. |
| The information stored in the model is similar to that of decision trees |
| with the difference that the tree-ids are stored in the second row and |
| rows $2,3,\ldots$ from the decision tree model are shifted by one. See |
| <a href="algorithms-classification.html#decision-trees">Decision Trees</a> for a description of the model.</p> |
| |
| <h3 id="random-forests-returns">Random Forests Returns</h3> |
| |
| <p>The matrix corresponding to the learned model is written to a file in |
| the format specified. See <a href="algorithms-classification.html#decision-trees">Decision Trees</a> where the |
| details about the structure of the model matrix is described. Similar to |
| <code class="language-plaintext highlighter-rouge">decision-tree.dml</code>, $X$ is split into $X_\text{cont}$ and |
| $X_\text{cat}$. If requested, the mappings of the continuous feature-ids |
| in $X_\text{cont}$ (stored at <code class="language-plaintext highlighter-rouge">S_map</code>) as well as the |
| categorical feature-ids in $X_\text{cat}$ (stored at |
| <code class="language-plaintext highlighter-rouge">C_map</code>) to the global feature-ids in $X$ will be provided. |
| The <code class="language-plaintext highlighter-rouge">random-forest-predict.dml</code> script may compute one or |
| more of predictions, accuracy, confusion matrix, and <code class="language-plaintext highlighter-rouge">OOB</code> error estimate |
| in the requested output format depending on the input arguments used.</p> |
| |
| </div> |
| <!-- |
| |
| --> |
| |
| |
| <!-- MathJax Section --> |
| <script type="text/x-mathjax-config"> |
| MathJax.Hub.Config({ TeX: { equationNumbers: { autoNumber: "AMS" } } }); |
| </script> |
| <script> |
| // Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS. |
| // We could use "//cdn.mathjax...", but that won't support "file://". |
| (function(d, script) { |
| script = d.createElement('script'); |
| script.type = 'text/javascript'; |
| script.async = true; |
| script.onload = function() { |
| MathJax.Hub.Config({ |
| tex2jax: { |
| inlineMath: [ |
| ["$", "$"], |
| ["\\\\(", "\\\\)"] |
| ], |
| displayMath: [ |
| ["$$", "$$"], |
| ["\\[", "\\]"] |
| ], |
| processEscapes: true, |
| skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] |
| } |
| }); |
| }; |
| script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') + |
| 'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'; |
| d.getElementsByTagName('head')[0].appendChild(script); |
| }(document)); |
| </script> |
| |
| </body> |
| |
| </html> |