docs/Algorithms Reference/RandomForest.tex - systemds - Git at Google

 \begin{comment}

  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  KIND, either express or implied.  See the License for the
  specific language governing permissions and limitations
  under the License.

 \end{comment}

 \subsection{Random Forests}
 \label{random_forests}

 \noindent{\bf Description}
 \smallskip


 Random forest is one of the most successful machine learning methods for classification and regression.
 It is an ensemble learning method that creates a model composed of a set of tree models.
 This implementation is well-suited to handle large-scale data and builds a random forest model for classification in parallel.\\


 \smallskip
 \noindent{\bf Usage}
 \smallskip

 {\hangindent=\parindent\noindent\it%
 	{\tt{}-f }path/\/{\tt{}random-forest.dml}
 	{\tt{} -nvargs}
 	{\tt{} X=}path/file
 	{\tt{} Y=}path/file
 	{\tt{} R=}path/file
 	{\tt{} bins=}integer
 	{\tt{} depth=}integer
 	{\tt{} num\_leaf=}integer
 	{\tt{} num\_samples=}integer
 	{\tt{} num\_trees=}integer
 	{\tt{} subsamp\_rate=}double
 	{\tt{} feature\_subset=}double
 	{\tt{} impurity=}Gini$\mid$entropy
 	{\tt{} M=}path/file
 	{\tt{} C=}path/file
 	{\tt{} S\_map=}path/file
 	{\tt{} C\_map=}path/file
 	{\tt{} fmt=}format

 }

  \smallskip
  \noindent{\bf Usage: Prediction}
  \smallskip

  {\hangindent=\parindent\noindent\it%
  	{\tt{}-f }path/\/{\tt{}random-forest-predict.dml}
  	{\tt{} -nvargs}
  	{\tt{} X=}path/file
  	{\tt{} Y=}path/file
  	{\tt{} R=}path/file
  	{\tt{} M=}path/file
  	{\tt{} C=}path/file
  	{\tt{} P=}path/file
  	{\tt{} A=}path/file
  	{\tt{} OOB=}path/file
  	{\tt{} CM=}path/file
  	{\tt{} fmt=}format

  }\smallskip


 \noindent{\bf Arguments}
 \begin{Description}
 	\item[{\tt X}:]
 	Location (on HDFS) to read the matrix of feature vectors;
 	each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded.
 	\item[{\tt Y}:]
 	Location (on HDFS) to read the matrix of (categorical)
 	labels that correspond to feature vectors in $X$. Note that classes are assumed to be both recoded and dummy coded.
 	This argument is optional for prediction.
 	\item[{\tt R}:] (default:\mbox{ }{\tt " "})
 	Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column).
 	If $R$ is not provided by default all features are assumed to be continuous-valued.
 	\item[{\tt bins}:] (default:\mbox{ }{\tt 20})
 	Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning).
 	\item[{\tt depth}:] (default:\mbox{ }{\tt 25})
 	Maximum depth of the learned trees in the random forest model
 	\item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10})
 	Parameter that controls pruning. The tree
 	is not expanded if a node receives less than {\tt num\_leaf} training examples.
 	\item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000})
 	Parameter that decides when to switch to in-memory building of the subtrees in each tree of the random forest model.
 	If a node $v$ receives less than {\tt num\_samples}
 	training examples then this implementation switches to an in-memory subtree
 	building procedure to build the subtree under $v$ in its entirety.
 	\item[{\tt num\_trees}:] (default:\mbox{ }{\tt 10})
 	Number of trees to be learned in the random forest model
 	\item[{\tt subsamp\_rate}:] (default:\mbox{ }{\tt 1.0})
 	Parameter controlling the size of each tree in the random forest model; samples are selected from a Poisson distribution with parameter {\tt subsamp\_rate}.
 	\item[{\tt feature\_subset}:] (default:\mbox{ }{\tt 0.5})
 	Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of the number of features in the data, i.e., assuming the training set has $D$ features $D^{\tt feature\_subset}$ are used at each tree node.
 	\item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"})
 	Impurity measure used at internal nodes of the trees in the random forest model for selecting which features to split on. Possible value are entropy or Gini.
 	\item[{\tt M}:]
 	Location (on HDFS) to write matrix $M$ containing the learned random forest (see Section~\ref{sec:decision_trees} and below for the schema)
 	\item[{\tt C}:] (default:\mbox{ }{\tt " "})
 	Location (on HDFS) to store the number of counts (generated according to a Poisson distribution with parameter {\tt subsamp\_rate}) for each feature vector. Note that this argument is optional. If Out-Of-Bag (OOB) error estimate needs to be computed this parameter is passed as input to {\tt random-forest-predict.dml}.
 	\item[{\tt A}:] (default:\mbox{ }{\tt " "})
 	Location (on HDFS) to store the testing accuracy (\%) from a
 	held-out test set during prediction. Note that this argument is optional.
 	\item[{\tt OOB}:] (default:\mbox{ }{\tt " "})
 	Location (on HDFS) to store the Out-Of-Bag (OOB) error estimate of the training set. Note that the matrix of sample counts (stored at {\tt C}) needs to be provided for computing OOB error estimate. Note that this argument is optional.
 	\item[{\tt P}:]
 	Location (on HDFS) to store predictions for a held-out test set
 	\item[{\tt CM}:] (default:\mbox{ }{\tt " "})
 	Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional.
 	\item[{\tt S\_map}:] (default:\mbox{ }{\tt " "})
 	Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
 	\item[{\tt C\_map}:] (default:\mbox{ }{\tt " "})
 	Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
 	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
 	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
 	see read/write functions in SystemML Language Reference for details.
 \end{Description}


  \noindent{\bf Details}
  \smallskip

 Random forests~\cite{Breiman01:rforest} are learning algorithms for ensembles of decision trees.
 The main idea is to build a number of decision trees on bootstrapped training samples, i.e., by taking repeatedly samples from a (single) training set.
 Moreover, instead of considering all the features when building the trees only a random subset of the features---typically $\approx \sqrt{D}$, where $D$ is the number of features---is chosen each time a split test at a tree node is performed.
 This procedure {\it decorrelates} the trees and makes it less prone to overfitting.
 To build decision trees we utilize the techniques discussed in Section~\ref{sec:decision_trees} proposed in~\cite{PandaHBB09:dtree};
 the implementation details are similar to those of the decision trees script.
 Below we review some features of our implementation which differ from {\tt decision-tree.dml}.


 \textbf{Bootstrapped sampling.}
 Each decision tree is fitted to a bootstrapped training set sampled with replacement (WR).
 To improve efficiency, we generate $N$ sample counts according to a Poisson distribution with parameter {\tt subsamp\_rate},
 where $N$ denotes the total number of training points.
 These sample counts approximate WR sampling when $N$ is large enough and are generated upfront for each decision tree.


 \textbf{Bagging.}
 Decision trees suffer from {\it high variance} resulting in different models whenever trained on a random subsets of the data points.
 {\it Bagging} is a general-purpose method to reduce the variance of a statistical learning method like decision trees.
 In the context of decision trees (for classification), for a given test feature vector
 the prediction is computed by taking a {\it majority vote}: the overall prediction is the most commonly occurring class among all the tree predictions.


 \textbf{Out-Of-Bag error estimation.}
 Note that each bagged tree in a random forest model is trained on a subset (around $\frac{2}{3}$) of the observations (i.e., feature vectors).
 The remaining ($\frac{1}{3}$ of the) observations not used for training is called the {\it Out-Of-Bag} (OOB) observations.
 This gives us a straightforward way to estimate the test error: to predict the class label of each test observation $i$ we use the trees in which $i$ was OOB.
 Our {\tt random-forest-predict.dml} script provides the OOB error estimate for a given training set if requested.


 \textbf{Description of the model.}
 Similar to decision trees, the learned random forest model is presented in a matrix $M$  with at least 7 rows.
 The information stored in the model is similar to that of decision trees with the difference that the tree-ids are stored
 in the second row and rows $2,3,\ldots$ from the decision tree model are shifted by one. See Section~\ref{sec:decision_trees} for a description of the model.


 \smallskip
 \noindent{\bf Returns}
 \smallskip


 The matrix corresponding to the learned model is written to a file in the format specified. See Section~\ref{sec:decision_trees} where the details about the structure of the model matrix is described.
 Similar to {\tt decision-tree.dml}, $X$ is split into $X_\text{cont}$ and $X_\text{cat}$.
 If requested, the mappings of the continuous feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) as well as the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided.
 The {\tt random-forest-predict.dml} script may compute one or more of
 predictions, accuracy, confusion matrix, and OOB error estimate in the requested output format depending on the input arguments used.


 \smallskip
 \noindent{\bf Examples}
 \smallskip

 {\hangindent=\parindent\noindent\tt
 	\hml -f random-forest.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
 	R=/user/biadmin/R.csv M=/user/biadmin/model.csv
 	bins=20 depth=25 num\_leaf=10 num\_samples=3000 num\_trees=10 impurity=Gini fmt=csv

 }\smallskip


 \noindent To compute predictions:

 {\hangindent=\parindent\noindent\tt
 	\hml -f random-forest-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv
 	M=/user/biadmin/model.csv P=/user/biadmin/predictions.csv
 	A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv

 }\smallskip


 %\noindent{\bf References}
 %
 %\begin{itemize}
 %\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009.
 %\item L. Breiman. \newblock{Random Forests}. Machine Learning, 45(1), 5--32, 2001.
 %\end{itemize}
	\begin{comment}

	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	\end{comment}

	\subsection{Random Forests}
	\label{random_forests}

	\noindent{\bf Description}
	\smallskip


	Random forest is one of the most successful machine learning methods for classification and regression.
	It is an ensemble learning method that creates a model composed of a set of tree models.
	This implementation is well-suited to handle large-scale data and builds a random forest model for classification in parallel.\\


	\smallskip
	\noindent{\bf Usage}
	\smallskip

	{\hangindent=\parindent\noindent\it%
	{\tt{}-f }path/\/{\tt{}random-forest.dml}
	{\tt{} -nvargs}
	{\tt{} X=}path/file
	{\tt{} Y=}path/file
	{\tt{} R=}path/file
	{\tt{} bins=}integer
	{\tt{} depth=}integer
	{\tt{} num\_leaf=}integer
	{\tt{} num\_samples=}integer
	{\tt{} num\_trees=}integer
	{\tt{} subsamp\_rate=}double
	{\tt{} feature\_subset=}double
	{\tt{} impurity=}Gini$\mid$entropy
	{\tt{} M=}path/file
	{\tt{} C=}path/file
	{\tt{} S\_map=}path/file
	{\tt{} C\_map=}path/file
	{\tt{} fmt=}format

	}

	\smallskip
	\noindent{\bf Usage: Prediction}
	\smallskip

	{\hangindent=\parindent\noindent\it%
	{\tt{}-f }path/\/{\tt{}random-forest-predict.dml}
	{\tt{} -nvargs}
	{\tt{} X=}path/file
	{\tt{} Y=}path/file
	{\tt{} R=}path/file
	{\tt{} M=}path/file
	{\tt{} C=}path/file
	{\tt{} P=}path/file
	{\tt{} A=}path/file
	{\tt{} OOB=}path/file
	{\tt{} CM=}path/file
	{\tt{} fmt=}format

	}\smallskip


	\noindent{\bf Arguments}
	\begin{Description}
	\item[{\tt X}:]
	Location (on HDFS) to read the matrix of feature vectors;
	each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded.
	\item[{\tt Y}:]
	Location (on HDFS) to read the matrix of (categorical)
	labels that correspond to feature vectors in $X$. Note that classes are assumed to be both recoded and dummy coded.
	This argument is optional for prediction.
	\item[{\tt R}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column).
	If $R$ is not provided by default all features are assumed to be continuous-valued.
	\item[{\tt bins}:] (default:\mbox{ }{\tt 20})
	Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning).
	\item[{\tt depth}:] (default:\mbox{ }{\tt 25})
	Maximum depth of the learned trees in the random forest model
	\item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10})
	Parameter that controls pruning. The tree
	is not expanded if a node receives less than {\tt num\_leaf} training examples.
	\item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000})
	Parameter that decides when to switch to in-memory building of the subtrees in each tree of the random forest model.
	If a node $v$ receives less than {\tt num\_samples}
	training examples then this implementation switches to an in-memory subtree
	building procedure to build the subtree under $v$ in its entirety.
	\item[{\tt num\_trees}:] (default:\mbox{ }{\tt 10})
	Number of trees to be learned in the random forest model
	\item[{\tt subsamp\_rate}:] (default:\mbox{ }{\tt 1.0})
	Parameter controlling the size of each tree in the random forest model; samples are selected from a Poisson distribution with parameter {\tt subsamp\_rate}.
	\item[{\tt feature\_subset}:] (default:\mbox{ }{\tt 0.5})
	Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of the number of features in the data, i.e., assuming the training set has $D$ features $D^{\tt feature\_subset}$ are used at each tree node.
	\item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"})
	Impurity measure used at internal nodes of the trees in the random forest model for selecting which features to split on. Possible value are entropy or Gini.
	\item[{\tt M}:]
	Location (on HDFS) to write matrix $M$ containing the learned random forest (see Section~\ref{sec:decision_trees} and below for the schema)
	\item[{\tt C}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to store the number of counts (generated according to a Poisson distribution with parameter {\tt subsamp\_rate}) for each feature vector. Note that this argument is optional. If Out-Of-Bag (OOB) error estimate needs to be computed this parameter is passed as input to {\tt random-forest-predict.dml}.
	\item[{\tt A}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to store the testing accuracy (\%) from a
	held-out test set during prediction. Note that this argument is optional.
	\item[{\tt OOB}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to store the Out-Of-Bag (OOB) error estimate of the training set. Note that the matrix of sample counts (stored at {\tt C}) needs to be provided for computing OOB error estimate. Note that this argument is optional.
	\item[{\tt P}:]
	Location (on HDFS) to store predictions for a held-out test set
	\item[{\tt CM}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional.
	\item[{\tt S\_map}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
	\item[{\tt C\_map}:] (default:\mbox{ }{\tt " "})
	Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
	see read/write functions in SystemML Language Reference for details.
	\end{Description}


	\noindent{\bf Details}
	\smallskip

	Random forests~\cite{Breiman01:rforest} are learning algorithms for ensembles of decision trees.
	The main idea is to build a number of decision trees on bootstrapped training samples, i.e., by taking repeatedly samples from a (single) training set.
	Moreover, instead of considering all the features when building the trees only a random subset of the features---typically $\approx \sqrt{D}$, where $D$ is the number of features---is chosen each time a split test at a tree node is performed.
	This procedure {\it decorrelates} the trees and makes it less prone to overfitting.
	To build decision trees we utilize the techniques discussed in Section~\ref{sec:decision_trees} proposed in~\cite{PandaHBB09:dtree};
	the implementation details are similar to those of the decision trees script.
	Below we review some features of our implementation which differ from {\tt decision-tree.dml}.


	\textbf{Bootstrapped sampling.}
	Each decision tree is fitted to a bootstrapped training set sampled with replacement (WR).
	To improve efficiency, we generate $N$ sample counts according to a Poisson distribution with parameter {\tt subsamp\_rate},
	where $N$ denotes the total number of training points.
	These sample counts approximate WR sampling when $N$ is large enough and are generated upfront for each decision tree.


	\textbf{Bagging.}
	Decision trees suffer from {\it high variance} resulting in different models whenever trained on a random subsets of the data points.
	{\it Bagging} is a general-purpose method to reduce the variance of a statistical learning method like decision trees.
	In the context of decision trees (for classification), for a given test feature vector
	the prediction is computed by taking a {\it majority vote}: the overall prediction is the most commonly occurring class among all the tree predictions.


	\textbf{Out-Of-Bag error estimation.}
	Note that each bagged tree in a random forest model is trained on a subset (around $\frac{2}{3}$) of the observations (i.e., feature vectors).
	The remaining ($\frac{1}{3}$ of the) observations not used for training is called the {\it Out-Of-Bag} (OOB) observations.
	This gives us a straightforward way to estimate the test error: to predict the class label of each test observation $i$ we use the trees in which $i$ was OOB.
	Our {\tt random-forest-predict.dml} script provides the OOB error estimate for a given training set if requested.


	\textbf{Description of the model.}
	Similar to decision trees, the learned random forest model is presented in a matrix $M$ with at least 7 rows.
	The information stored in the model is similar to that of decision trees with the difference that the tree-ids are stored
	in the second row and rows $2,3,\ldots$ from the decision tree model are shifted by one. See Section~\ref{sec:decision_trees} for a description of the model.


	\smallskip
	\noindent{\bf Returns}
	\smallskip


	The matrix corresponding to the learned model is written to a file in the format specified. See Section~\ref{sec:decision_trees} where the details about the structure of the model matrix is described.
	Similar to {\tt decision-tree.dml}, $X$ is split into $X_\text{cont}$ and $X_\text{cat}$.
	If requested, the mappings of the continuous feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) as well as the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided.
	The {\tt random-forest-predict.dml} script may compute one or more of
	predictions, accuracy, confusion matrix, and OOB error estimate in the requested output format depending on the input arguments used.



	\smallskip
	\noindent{\bf Examples}
	\smallskip

	{\hangindent=\parindent\noindent\tt
	\hml -f random-forest.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
	R=/user/biadmin/R.csv M=/user/biadmin/model.csv
	bins=20 depth=25 num\_leaf=10 num\_samples=3000 num\_trees=10 impurity=Gini fmt=csv

	}\smallskip


	\noindent To compute predictions:

	{\hangindent=\parindent\noindent\tt
	\hml -f random-forest-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv
	M=/user/biadmin/model.csv P=/user/biadmin/predictions.csv
	A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv

	}\smallskip


	%\noindent{\bf References}
	%
	%\begin{itemize}
	%\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009.
	%\item L. Breiman. \newblock{Random Forests}. Machine Learning, 45(1), 5--32, 2001.
	%\end{itemize}