| \begin{comment} |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| \end{comment} |
| |
| \subsection{Random Forests} |
| \label{random_forests} |
| |
| \noindent{\bf Description} |
| \smallskip |
| |
| |
| Random forest is one of the most successful machine learning methods for classification and regression. |
| It is an ensemble learning method that creates a model composed of a set of tree models. |
| This implementation is well-suited to handle large-scale data and builds a random forest model for classification in parallel.\\ |
| |
| |
| \smallskip |
| \noindent{\bf Usage} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\it% |
| {\tt{}-f }path/\/{\tt{}random-forest.dml} |
| {\tt{} -nvargs} |
| {\tt{} X=}path/file |
| {\tt{} Y=}path/file |
| {\tt{} R=}path/file |
| {\tt{} bins=}integer |
| {\tt{} depth=}integer |
| {\tt{} num\_leaf=}integer |
| {\tt{} num\_samples=}integer |
| {\tt{} num\_trees=}integer |
| {\tt{} subsamp\_rate=}double |
| {\tt{} feature\_subset=}double |
| {\tt{} impurity=}Gini$\mid$entropy |
| {\tt{} M=}path/file |
| {\tt{} C=}path/file |
| {\tt{} S\_map=}path/file |
| {\tt{} C\_map=}path/file |
| {\tt{} fmt=}format |
| |
| } |
| |
| \smallskip |
| \noindent{\bf Usage: Prediction} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\it% |
| {\tt{}-f }path/\/{\tt{}random-forest-predict.dml} |
| {\tt{} -nvargs} |
| {\tt{} X=}path/file |
| {\tt{} Y=}path/file |
| {\tt{} R=}path/file |
| {\tt{} M=}path/file |
| {\tt{} C=}path/file |
| {\tt{} P=}path/file |
| {\tt{} A=}path/file |
| {\tt{} OOB=}path/file |
| {\tt{} CM=}path/file |
| {\tt{} fmt=}format |
| |
| }\smallskip |
| |
| |
| \noindent{\bf Arguments} |
| \begin{Description} |
| \item[{\tt X}:] |
| Location (on HDFS) to read the matrix of feature vectors; |
| each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded. |
| \item[{\tt Y}:] |
| Location (on HDFS) to read the matrix of (categorical) |
| labels that correspond to feature vectors in $X$. Note that classes are assumed to be both recoded and dummy coded. |
| This argument is optional for prediction. |
| \item[{\tt R}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column). |
| If $R$ is not provided by default all features are assumed to be continuous-valued. |
| \item[{\tt bins}:] (default:\mbox{ }{\tt 20}) |
| Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning). |
| \item[{\tt depth}:] (default:\mbox{ }{\tt 25}) |
| Maximum depth of the learned trees in the random forest model |
| \item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10}) |
| Parameter that controls pruning. The tree |
| is not expanded if a node receives less than {\tt num\_leaf} training examples. |
| \item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000}) |
| Parameter that decides when to switch to in-memory building of the subtrees in each tree of the random forest model. |
| If a node $v$ receives less than {\tt num\_samples} |
| training examples then this implementation switches to an in-memory subtree |
| building procedure to build the subtree under $v$ in its entirety. |
| \item[{\tt num\_trees}:] (default:\mbox{ }{\tt 10}) |
| Number of trees to be learned in the random forest model |
| \item[{\tt subsamp\_rate}:] (default:\mbox{ }{\tt 1.0}) |
| Parameter controlling the size of each tree in the random forest model; samples are selected from a Poisson distribution with parameter {\tt subsamp\_rate}. |
| \item[{\tt feature\_subset}:] (default:\mbox{ }{\tt 0.5}) |
| Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of the number of features in the data, i.e., assuming the training set has $D$ features $D^{\tt feature\_subset}$ are used at each tree node. |
| \item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"}) |
| Impurity measure used at internal nodes of the trees in the random forest model for selecting which features to split on. Possible value are entropy or Gini. |
| \item[{\tt M}:] |
| Location (on HDFS) to write matrix $M$ containing the learned random forest (see Section~\ref{sec:decision_trees} and below for the schema) |
| \item[{\tt C}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to store the number of counts (generated according to a Poisson distribution with parameter {\tt subsamp\_rate}) for each feature vector. Note that this argument is optional. If Out-Of-Bag (OOB) error estimate needs to be computed this parameter is passed as input to {\tt random-forest-predict.dml}. |
| \item[{\tt A}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to store the testing accuracy (\%) from a |
| held-out test set during prediction. Note that this argument is optional. |
| \item[{\tt OOB}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to store the Out-Of-Bag (OOB) error estimate of the training set. Note that the matrix of sample counts (stored at {\tt C}) needs to be provided for computing OOB error estimate. Note that this argument is optional. |
| \item[{\tt P}:] |
| Location (on HDFS) to store predictions for a held-out test set |
| \item[{\tt CM}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional. |
| \item[{\tt S\_map}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional. |
| \item[{\tt C\_map}:] (default:\mbox{ }{\tt " "}) |
| Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional. |
| \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"}) |
| Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}; |
| see read/write functions in SystemML Language Reference for details. |
| \end{Description} |
| |
| |
| \noindent{\bf Details} |
| \smallskip |
| |
| Random forests~\cite{Breiman01:rforest} are learning algorithms for ensembles of decision trees. |
| The main idea is to build a number of decision trees on bootstrapped training samples, i.e., by taking repeatedly samples from a (single) training set. |
| Moreover, instead of considering all the features when building the trees only a random subset of the features---typically $\approx \sqrt{D}$, where $D$ is the number of features---is chosen each time a split test at a tree node is performed. |
| This procedure {\it decorrelates} the trees and makes it less prone to overfitting. |
| To build decision trees we utilize the techniques discussed in Section~\ref{sec:decision_trees} proposed in~\cite{PandaHBB09:dtree}; |
| the implementation details are similar to those of the decision trees script. |
| Below we review some features of our implementation which differ from {\tt decision-tree.dml}. |
| |
| |
| \textbf{Bootstrapped sampling.} |
| Each decision tree is fitted to a bootstrapped training set sampled with replacement (WR). |
| To improve efficiency, we generate $N$ sample counts according to a Poisson distribution with parameter {\tt subsamp\_rate}, |
| where $N$ denotes the total number of training points. |
| These sample counts approximate WR sampling when $N$ is large enough and are generated upfront for each decision tree. |
| |
| |
| \textbf{Bagging.} |
| Decision trees suffer from {\it high variance} resulting in different models whenever trained on a random subsets of the data points. |
| {\it Bagging} is a general-purpose method to reduce the variance of a statistical learning method like decision trees. |
| In the context of decision trees (for classification), for a given test feature vector |
| the prediction is computed by taking a {\it majority vote}: the overall prediction is the most commonly occurring class among all the tree predictions. |
| |
| |
| \textbf{Out-Of-Bag error estimation.} |
| Note that each bagged tree in a random forest model is trained on a subset (around $\frac{2}{3}$) of the observations (i.e., feature vectors). |
| The remaining ($\frac{1}{3}$ of the) observations not used for training is called the {\it Out-Of-Bag} (OOB) observations. |
| This gives us a straightforward way to estimate the test error: to predict the class label of each test observation $i$ we use the trees in which $i$ was OOB. |
| Our {\tt random-forest-predict.dml} script provides the OOB error estimate for a given training set if requested. |
| |
| |
| \textbf{Description of the model.} |
| Similar to decision trees, the learned random forest model is presented in a matrix $M$ with at least 7 rows. |
| The information stored in the model is similar to that of decision trees with the difference that the tree-ids are stored |
| in the second row and rows $2,3,\ldots$ from the decision tree model are shifted by one. See Section~\ref{sec:decision_trees} for a description of the model. |
| |
| |
| \smallskip |
| \noindent{\bf Returns} |
| \smallskip |
| |
| |
| The matrix corresponding to the learned model is written to a file in the format specified. See Section~\ref{sec:decision_trees} where the details about the structure of the model matrix is described. |
| Similar to {\tt decision-tree.dml}, $X$ is split into $X_\text{cont}$ and $X_\text{cat}$. |
| If requested, the mappings of the continuous feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) as well as the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided. |
| The {\tt random-forest-predict.dml} script may compute one or more of |
| predictions, accuracy, confusion matrix, and OOB error estimate in the requested output format depending on the input arguments used. |
| |
| |
| |
| \smallskip |
| \noindent{\bf Examples} |
| \smallskip |
| |
| {\hangindent=\parindent\noindent\tt |
| \hml -f random-forest.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx |
| R=/user/biadmin/R.csv M=/user/biadmin/model.csv |
| bins=20 depth=25 num\_leaf=10 num\_samples=3000 num\_trees=10 impurity=Gini fmt=csv |
| |
| }\smallskip |
| |
| |
| \noindent To compute predictions: |
| |
| {\hangindent=\parindent\noindent\tt |
| \hml -f random-forest-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv |
| M=/user/biadmin/model.csv P=/user/biadmin/predictions.csv |
| A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv |
| |
| }\smallskip |
| |
| |
| %\noindent{\bf References} |
| % |
| %\begin{itemize} |
| %\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009. |
| %\item L. Breiman. \newblock{Random Forests}. Machine Learning, 45(1), 5--32, 2001. |
| %\end{itemize} |